Skip to content

[AGIPD][darks] speed up darks processing

David Hammer requested to merge feat/agipd-darks-speedup into master

Description

A bunch of things in the AGIPD darks notebook could be faster. In order of probably more impactful to less important changes proposed:

  1. Computation in =characterize_module= can be parallelized
  • With something like pasha, this is a small code change with a very large impact
  1. When retrieving old constants, we don't need to save them to the output folder (do we?), we have them in memory for comparison already
  2. When retrieving old constants for comparison, we can do some of the other plotting while waiting for the DB
  3. Computing the ADU statistics tables at the end can also be parallelized

How has this been tested?

See the long description with gratuitous plots further down. To check that output is not borked, I compared with output from master branch when running on some recent data:

xfel-calibrate AGIPD DARK \
	--slurm-name "${PROJECT}" \
	--in-folder "/gpfs/exfel/exp/SPB/202130/p900188/raw" \
	--out-folder "${TESTROOT}/${PROJECT}/batch/${TESTNAME}-${RUNNAME}-${TIMESTAMP}-data" \
	--report-to "${TESTROOT}/${PROJECT}/batch/${TESTNAME}-${RUNNAME}-${TIMESTAMP}-report" \
	--run-high 338 \
	--run-med 339 \
	--run-low 340 \
	--karabo-id "SPB_DET_AGIPD1M-1" \
	--karabo-id-control "SPB_IRU_AGIPD1M1" \
	--karabo-da-control "AGIPD1MCTRL00" \
	--h5path-ctrl "/CONTROL/{}/MDL/FPGA_COMP" \
	--local-output \
	--no-db-output

Judging by h5diff nothing changed on the output side of things.

Types of changes

Gratuitous plotting

As demonstrated in https://git.xfel.eu/gitlab/detectors/pycalibration/merge_requests/446 (closed the old MR as master has changed a lot since then and my changes here are much less work than rebasing), there is room for significant speedups by parallelizing characterize_module. One simple strategy is to keep the multiprocessing.Pool starting characterize_module and within each call using threading with something like pasha.

Experiment

Setup: default parameters for notebook (some CALLAB proposal), processing modules 0 and 1 (as I think each slurm job would get 16/8 modules). Looking at the resource usage over time when executing the heaviest cell which does all the characterize_module work.

First, the version from master: image As it's using multiprocessing.Pool without explicit worker count, it probably spawned 6 processes as there are six files. CPU usage is pretty low; this was on max-exfl1237 which can do 72 threads. Note that even if we had many files, we'd probably run out of memory trying to just scale up the number of processes with a file each.

For comparison, here's the simple version using pasha as presented with the first commit in this MR. Using 6 processes with 12 threads each: image

Further improvement

Let's look at the entire notebook (after parallelizing the main characterize_cell part) with some arbitrary checkpoints: measurements-2021-04-09-09-39 Breaking this down slightly more shows some candidates for further optimization: measurements-2021-04-09-13-03-checkpoints

  • Speed up computing the statistics tables
    • There's a fair bit of computation going on in this last cell
    • Commit 05327835 throws parallelism at this; probably way overkill, but speeds that cell up something like 6x
  • Speed up retrieval of old constants
    • As pointed out by @ahmedk, the notebook currently gets file paths from db, then copies files to new folder, then loads them
    • Simply loading the files from wherever they are should already help a lot
  • Simplify data flow
    • The notebook was heavily influenced by ipyparallel, so some parts could be simpler
    • Not sure if this would speed up much, but let's see (likely to at least save memory)
    • Update: what I wanted to do here is a pretty big diff and there's not much performance to be gained, so I would leave this out
  • Clean up
    • Will reorder some things and rename a bit
    • Will put back get_pdu_from_db

Somewhat recent breakdown: image image

Or how about full comparison: cool-overview

Comparing full jobs

For a "real life" test, I ran xfel-calibrate on the aforementioned data from SPB (p900188 runs 338, 339, and 340) a few times yesterday. Comparing master, mr (pretty much what you see here), and experimental (where I tried simplifying the data flow in not so important ways apparently). Note that there was quite some variance: probably file system load during the day and node allocation differences. And something went wrong with the master batch started 19:18, so it didn't do much computation. If we sum the time elapsed for all 16 jobs in each batch, the difference is pretty stark: image

Looking a bit at distribution of time over different jobs in the batches: image

Reviewers

@calibration @jsztuk

I have already made a number of comments for discussion. The major changes 1-4 are ordered roughly according to how much we gain relative to how much complexity they add to the code.

Edited by David Hammer

Merge request reports