Skip to content

Enable GPU acceleration of the BOZ correction determination

David Hammer requested to merge boz-enable-gpu into master

Overview

In boz.py, I've added a call, use_gpu() to the parameters object. This can be called after data has been loaded and will move the full data to GPU, managed by cupy (and wrapped again with dask). Throughout the code, I had to sprinkle some checks to see whether to go on GPU or not and some ensure_on_host to move less critical or memory intense parts back to main memory. The result is that, if you have installed cupy and are running on a node with an Nvidia GPU with at least 20-something GB of memory, then the following cell added to the 1.a notebook will make things pretty fast:

dask.config.set(scheduler="single-threaded")
params.use_gpu()

Note it's not strictly necessary to run single-threaded, but it was easier to reason about memory behaviour and as the heavy parts are offloaded to GPU anyway, threading dask doesn't do much. Maybe tweaking threading and chunking would be worthwhile.

Benchmarks

I just ran through the notebook with the default paths as found in the repository (output path changed to something in my scratch, of course):

proposal = 2937
darkrun = 615
run = 614
module = 15
gain = 3
sat_level = 500
rois_th = 4
ff_prod_th = 350
ff_ratio_th = 0.75

On GPU

Test run on max-exflg006. This node has two Intel Xeon CPU E5-2640 v4 @ 2.40GHz and, more importantly, a Tesla V100 GPU (PCIe model with 32GB memory).

First run through after getting the types lined up and memory usage under control:

  • FF fit: Wall time: 5min 39s
  • NL fit: Wall time: 28min 46s

Re-ran to fix some rough edges (param saving):

  • FF fit: Wall time: 5min 39s
  • NL fit: 28min 39s

Total notebook execution time for the re-run was approximately 36 minutes. I guess by this time, the GPFS cache was hot (loading data to memory was pretty fast)

On CPU:

Test run on max-exfl260. This node has two Intel Xeon Gold 6240 CPU @ 2.60GHz.

  • FF fit: Wall time: 1h 14min 51s
  • NL fit: not done yet, iteration 16 (of max 25) at 3h 17min 37s

Conclusion

I started running the regular CPU version of the notebook before starting to re-implement the relevant changes to enable GPU acceleration. So this means an hour of coding, a bunch of restarting the GPU version for trial and error debugging (keeping track of what goes on GPU and memory consumption) passed before I got the GPU version to run through cleanly. Then I ran the GPU version again for a full run through measurement. Then I went and painstakingly committed only the relevant bits of boz.py and the notebook to a branch on the repository for creating this MR. After all this the CPU test was still not done, so the difference is pretty drastic.

@carinanc @lleguy

Merge request reports