Enable GPU acceleration of the BOZ correction determination
Overview
In boz.py
, I've added a call, use_gpu()
to the parameters object.
This can be called after data has been loaded and will move the full data to GPU, managed by cupy
(and wrapped again with dask
).
Throughout the code, I had to sprinkle some checks to see whether to go on GPU or not and some ensure_on_host
to move less critical or memory intense parts back to main memory.
The result is that, if you have installed cupy
and are running on a node with an Nvidia GPU with at least 20-something GB of memory, then the following cell added to the 1.a notebook will make things pretty fast:
dask.config.set(scheduler="single-threaded")
params.use_gpu()
Note it's not strictly necessary to run single-threaded, but it was easier to reason about memory behaviour and as the heavy parts are offloaded to GPU anyway, threading dask
doesn't do much.
Maybe tweaking threading and chunking would be worthwhile.
Benchmarks
I just ran through the notebook with the default paths as found in the repository (output path changed to something in my scratch, of course):
proposal = 2937
darkrun = 615
run = 614
module = 15
gain = 3
sat_level = 500
rois_th = 4
ff_prod_th = 350
ff_ratio_th = 0.75
On GPU
Test run on max-exflg006
.
This node has two Intel Xeon CPU E5-2640 v4 @ 2.40GHz and, more importantly, a Tesla V100 GPU (PCIe model with 32GB memory).
First run through after getting the types lined up and memory usage under control:
- FF fit: Wall time: 5min 39s
- NL fit: Wall time: 28min 46s
Re-ran to fix some rough edges (param saving):
- FF fit: Wall time: 5min 39s
- NL fit: 28min 39s
Total notebook execution time for the re-run was approximately 36 minutes. I guess by this time, the GPFS cache was hot (loading data to memory was pretty fast)
On CPU:
Test run on max-exfl260
.
This node has two Intel Xeon Gold 6240 CPU @ 2.60GHz.
- FF fit: Wall time: 1h 14min 51s
- NL fit: not done yet, iteration 16 (of max 25) at 3h 17min 37s
Conclusion
I started running the regular CPU version of the notebook before starting to re-implement the relevant changes to enable GPU acceleration. So this means an hour of coding, a bunch of restarting the GPU version for trial and error debugging (keeping track of what goes on GPU and memory consumption) passed before I got the GPU version to run through cleanly. Then I ran the GPU version again for a full run through measurement. Then I went and painstakingly committed only the relevant bits of boz.py and the notebook to a branch on the repository for creating this MR. After all this the CPU test was still not done, so the difference is pretty drastic.
Merge request reports
Activity
requested review from @carinanc
assigned to @hammerd
117 126 127 def use_gpu(self): 128 assert _can_use_gpu, 'Failed to import cupy' 129 gpu_mem_gb = cp.cuda.Device().mem_info[1] / 2**30 130 if gpu_mem_gb < 30: 131 print(f'Warning: GPU memory ({gpu_mem_gb}GB) may be insufficient') 132 if self._using_gpu: 133 return 134 assert ( 135 self.arr is not None and 136 self.arr_dark is not None 137 ), "Must load data before switching to GPU" 138 if self.mask is not None: 139 self.mask = cp.array(self.mask) 140 # moving full data to GPU 141 limit = 2**30 I just put in this warning in case someone tries to run this on one of the smaller GPUs - say, a P100 with "only" 16 GB of memory. In that case, CuPy will probably end up throwing an
OutOfMemoryError
. When testing - even though the full data is < 6 GB and the dark data is < 2 GB, some of the computations (probably buffers managed by CuPy or even dask and then via CuPy) cause memory consumption to balloon to 20-something GB. To be fair, the input sizes I mention are when data is stilluint16
, so that's not actually outrageous (especially as the originaluint16
version is kept around for the LUT parts).I don't think we have any 26 GB GPUs, so I'd just recommend going for a 32 GB one like the V100 I was testing on.
mentioned in commit e9a355ce