Add common mode correction step to AGIPD
Egor's common mode correction addon is getting popular, so I figured we could put the algorithm in as a correction step for AGIPD and spend some time making a dedicated fast kernel.
The implementation from the addon is something like (please ignore the large noise_peak_range
, I was just testing with generated data):
def common_mode_correction(
data, num_iter=4, min_dark_fraction=0.15, noise_peak_range=1500
):
n_cells, n_x, n_y = data.shape
per_asic_data = data.reshape(n_cells, n_x // 64, 64, n_y // 64, 64)
min_dark_pixels = 4096 * min_dark_fraction
dark_pixels = cp.ones_like(per_asic_data)
for i in range(num_iter):
dark_pixels[:] = per_asic_data
dark_pixels[cp.abs(per_asic_data) > noise_peak_range] = cp.nan
num_dark_pixels = cp.sum(cp.isfinite(dark_pixels), axis=(2, 4), keepdims=True)
baseline = cp.nansum(dark_pixels, axis=(2, 4), keepdims=True) / num_dark_pixels
baseline[num_dark_pixels < min_dark_pixels] = .0
per_asic_data -= baseline
To avoid the overhead of multiple Cupy functions plus the obvious data allocation / copying, I tried my hand at a single custom CUDA kernel to do the same thing. The fastest version I came up with so far is in this MR; works something like this:
Testing on a node with a P100 GPU, the reference implementation takes on average 18.88 ms whereas the custom kernel takes on average 2.21 ms for 352 frames. I do want to try a few more variants of the kernel, but it's unclear if that's worthwhile. I think this one already enjoys memory access coalescing, not sure how much more to expect to squeeze. Napkin math: if we are just reading and writing the image data array four times, we're already at 45.62 % of the P100's supposed 732.2 GB/s memory bandwidth.
In testing the result is very close to the reference implementation. Kernel not currently doing anything to improve numerical stability - am definitely open to adding that.
Effect on some arbitrary AGIPD data (excuse the downsampling, SSH is slow). Bad pixel masking turned off because whatever constants I got were masking everything. Still, intended effect is shown; these ASICs clearly have different baselines going on:
And here most are brought in line:
If you look closely, you can maybe tell from the performance counters that I turned on the common mode correction around 17:24. Still, easily fast enough for 10 Hz even on this P100.