Skip to content

Add CPU kernels for some big detectors

David Hammer requested to merge feat/cpu-agipd-dssc into master

tl;dr:

  • Add Cython kernels for AGIPD and DSSC
  • Refactor large parts of correction kernel runner code to make this nice
    • JUNGFRAU - which already had both - refactored according to updated design

How fast is this?

Quick measurements on Maxwell P100 node (has the Tesla P100-PCIE and Intel Xeon E5-2640 v4 @ 2.40GHz). Setup:

  • Full 352 memory cell AGIPD data
  • All constants loaded, all default corrections on unless otherwise specified
  • Tested with four modules or just one module being corrected (in each case, also running preview assemblers on same node)
  • No frame filter
  • No axis reordering

Findings:

  • GPU, four modules: easily full speed with around 40 ms per train
    • there's probably some ordering going on with the processes competing for GPU; some get around 24 ms, others fluctuate up to around 50 ms
    • nvidia-smi reports around 70 % volatile GPU util
    • uses around 8 of the 16 GiB GPU memory
  • GPU, one module: around 30 ms per train
  • CPU, four modules: not full speed, broadly 80-130 ms per train
    • but all fluctuate; I don't think they ever all hit low end on same train
    • rates from around 5.6 Hz to 6.7 Hz
    • preview matching ratio (keep in mind: four modules) meandering between 15 and 30 %
    • all 40 cores running hot around 70 % on htop
  • CPU, four modules, all corrections off: easily full speed with around 40 ms per train
  • CPU, one module: easily full speed at around 42 ms per train

Why is the diff so big?

The JUNGFRAU CPU / GPU runner implementation was a lot of copy-paste. I figured I'd DRY this a bit* as a prototype for how to make it better for AGIPD and DSSC. The proposed redesign in this MR gets rid of the BaseGpuRunner (subclasss of BaseKernelRunner); the differences are pretty small with CuPy. Instead, when multiple runners are needed for one detector type (ex. JUNGFRAU), one common runner class inheriting from BaseKernelRunner (ex. JungfrauBaseRunner) sets up the constant buffers and such and specific subclasses (ex. JungfrauGpuRunner and JungfrauCpuRunner)

old-jf-runner

A few bits in the BaseKernelRunner need to take CPU vs GPU into account. I suppose we could get rid of that by having a GPU subclass or providing overrides via mixin class, but I don't think the complexity here justifies multiple inheritance.

*As part of this DRYing, I tried making the interface between the device and the kernel runner more consistent.

How has this been tested?

I instantiated my development testing pipelines and they behave normally. Will post some speed estimates. Planning to add output comparison between GPU and CPU versions to the flimsy test set.

Edited by David Hammer

Merge request reports