Skip to content
Snippets Groups Projects

Enable GPU acceleration of the BOZ correction determination

Merged David Hammer requested to merge boz-enable-gpu into master
2 unresolved threads

Overview

In boz.py, I've added a call, use_gpu() to the parameters object. This can be called after data has been loaded and will move the full data to GPU, managed by cupy (and wrapped again with dask). Throughout the code, I had to sprinkle some checks to see whether to go on GPU or not and some ensure_on_host to move less critical or memory intense parts back to main memory. The result is that, if you have installed cupy and are running on a node with an Nvidia GPU with at least 20-something GB of memory, then the following cell added to the 1.a notebook will make things pretty fast:

dask.config.set(scheduler="single-threaded")
params.use_gpu()

Note it's not strictly necessary to run single-threaded, but it was easier to reason about memory behaviour and as the heavy parts are offloaded to GPU anyway, threading dask doesn't do much. Maybe tweaking threading and chunking would be worthwhile.

Benchmarks

I just ran through the notebook with the default paths as found in the repository (output path changed to something in my scratch, of course):

proposal = 2937
darkrun = 615
run = 614
module = 15
gain = 3
sat_level = 500
rois_th = 4
ff_prod_th = 350
ff_ratio_th = 0.75

On GPU

Test run on max-exflg006. This node has two Intel Xeon CPU E5-2640 v4 @ 2.40GHz and, more importantly, a Tesla V100 GPU (PCIe model with 32GB memory).

First run through after getting the types lined up and memory usage under control:

  • FF fit: Wall time: 5min 39s
  • NL fit: Wall time: 28min 46s

Re-ran to fix some rough edges (param saving):

  • FF fit: Wall time: 5min 39s
  • NL fit: 28min 39s

Total notebook execution time for the re-run was approximately 36 minutes. I guess by this time, the GPFS cache was hot (loading data to memory was pretty fast)

On CPU:

Test run on max-exfl260. This node has two Intel Xeon Gold 6240 CPU @ 2.60GHz.

  • FF fit: Wall time: 1h 14min 51s
  • NL fit: not done yet, iteration 16 (of max 25) at 3h 17min 37s

Conclusion

I started running the regular CPU version of the notebook before starting to re-implement the relevant changes to enable GPU acceleration. So this means an hour of coding, a bunch of restarting the GPU version for trial and error debugging (keeping track of what goes on GPU and memory consumption) passed before I got the GPU version to run through cleanly. Then I ran the GPU version again for a full run through measurement. Then I went and painstakingly committed only the relevant bits of boz.py and the notebook to a branch on the repository for creating this MR. After all this the CPU test was still not done, so the difference is pretty drastic.

@carinanc @lleguy

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
117 126
127 def use_gpu(self):
128 assert _can_use_gpu, 'Failed to import cupy'
129 gpu_mem_gb = cp.cuda.Device().mem_info[1] / 2**30
130 if gpu_mem_gb < 30:
131 print(f'Warning: GPU memory ({gpu_mem_gb}GB) may be insufficient')
132 if self._using_gpu:
133 return
134 assert (
135 self.arr is not None and
136 self.arr_dark is not None
137 ), "Must load data before switching to GPU"
138 if self.mask is not None:
139 self.mask = cp.array(self.mask)
140 # moving full data to GPU
141 limit = 2**30
  • Curious question: what is the significance of this limit?

  • Author Developer

    I just put in this warning in case someone tries to run this on one of the smaller GPUs - say, a P100 with "only" 16 GB of memory. In that case, CuPy will probably end up throwing an OutOfMemoryError. When testing - even though the full data is < 6 GB and the dark data is < 2 GB, some of the computations (probably buffers managed by CuPy or even dask and then via CuPy) cause memory consumption to balloon to 20-something GB. To be fair, the input sizes I mention are when data is still uint16, so that's not actually outrageous (especially as the original uint16 version is kept around for the LUT parts).

    I don't think we have any 26 GB GPUs, so I'd just recommend going for a 32 GB one like the V100 I was testing on.

  • Please register or sign in to reply
    • Thanks for working on this, David! I've played around with it and I confirm that there's a big improvement in terms of computation time and that the produced parameters is (almost) equal to the old implementation.

      Code-wise LGTM. Maybe we can test it further this week before merging?

    • Author Developer

      Sure thing! I guess testing would mean re-running previously analyzed runs with the exact same parameters and GPU acceleration to check that output is consistent? Or, I suppose, just running with and without GPU - when GPU is not used, the behavior should not change with this MR.

    • Ah, what I meant is using the code in production and check how it scales with (potential) user/notebook changes. I believe that everything will work out well as we already have tested with a couple of previous runs, but it would be nice to have a successful beamtime as a banner :stuck_out_tongue:

    • Please register or sign in to reply
  • This worked nicely last beamtime, thanks for your hard work! I'll merge this now :smile_cat:

  • mentioned in commit e9a355ce

  • Please register or sign in to reply
    Loading