Add CPU kernels for some big detectors
tl;dr:
- Add Cython kernels for AGIPD and DSSC
- Refactor large parts of correction kernel runner code to make this nice
- JUNGFRAU - which already had both - refactored according to updated design
How fast is this?
Quick measurements on Maxwell P100 node (has the Tesla P100-PCIE and Intel Xeon E5-2640 v4 @ 2.40GHz). Setup:
- Full 352 memory cell AGIPD data
- All constants loaded, all default corrections on unless otherwise specified
- Tested with four modules or just one module being corrected (in each case, also running preview assemblers on same node)
- No frame filter
- No axis reordering
Findings:
- GPU, four modules: easily full speed with around 40 ms per train
- there's probably some ordering going on with the processes competing for GPU; some get around 24 ms, others fluctuate up to around 50 ms
-
nvidia-smi
reports around 70 % volatile GPU util - uses around 8 of the 16 GiB GPU memory
- GPU, one module: around 30 ms per train
- CPU, four modules: not full speed, broadly 80-130 ms per train
- but all fluctuate; I don't think they ever all hit low end on same train
- rates from around 5.6 Hz to 6.7 Hz
- preview matching ratio (keep in mind: four modules) meandering between 15 and 30 %
- all 40 cores running hot around 70 % on
htop
- CPU, four modules, all corrections off: easily full speed with around 40 ms per train
- CPU, one module: easily full speed at around 42 ms per train
Why is the diff so big?
The JUNGFRAU CPU / GPU runner implementation was a lot of copy-paste.
I figured I'd DRY this a bit* as a prototype for how to make it better for AGIPD and DSSC.
The proposed redesign in this MR gets rid of the BaseGpuRunner
(subclasss of BaseKernelRunner
); the differences are pretty small with CuPy.
Instead, when multiple runners are needed for one detector type (ex. JUNGFRAU), one common runner class inheriting from BaseKernelRunner
(ex. JungfrauBaseRunner
) sets up the constant buffers and such and specific subclasses (ex. JungfrauGpuRunner
and JungfrauCpuRunner
)
A few bits in the BaseKernelRunner
need to take CPU vs GPU into account.
I suppose we could get rid of that by having a GPU subclass or providing overrides via mixin class, but I don't think the complexity here justifies multiple inheritance.
*As part of this DRYing, I tried making the interface between the device and the kernel runner more consistent.
How has this been tested?
I instantiated my development testing pipelines and they behave normally. Will post some speed estimates. Planning to add output comparison between GPU and CPU versions to the flimsy test set.