[AGIPD][darks] speed up darks processing
Description
A bunch of things in the AGIPD darks notebook could be faster. In order of probably more impactful to less important changes proposed:
- Computation in =characterize_module= can be parallelized
- With something like pasha, this is a small code change with a very large impact
- When retrieving old constants, we don't need to save them to the output folder (do we?), we have them in memory for comparison already
- When retrieving old constants for comparison, we can do some of the other plotting while waiting for the DB
- Computing the ADU statistics tables at the end can also be parallelized
How has this been tested?
See the long description with gratuitous plots further down.
To check that output is not borked, I compared with output from master
branch when running on some recent data:
xfel-calibrate AGIPD DARK \
--slurm-name "${PROJECT}" \
--in-folder "/gpfs/exfel/exp/SPB/202130/p900188/raw" \
--out-folder "${TESTROOT}/${PROJECT}/batch/${TESTNAME}-${RUNNAME}-${TIMESTAMP}-data" \
--report-to "${TESTROOT}/${PROJECT}/batch/${TESTNAME}-${RUNNAME}-${TIMESTAMP}-report" \
--run-high 338 \
--run-med 339 \
--run-low 340 \
--karabo-id "SPB_DET_AGIPD1M-1" \
--karabo-id-control "SPB_IRU_AGIPD1M1" \
--karabo-da-control "AGIPD1MCTRL00" \
--h5path-ctrl "/CONTROL/{}/MDL/FPGA_COMP" \
--local-output \
--no-db-output
Judging by h5diff
nothing changed on the output side of things.
Types of changes
- Bug fix (had to refactor a bit anyway, including the code causing https://git.xfel.eu/gitlab/detectors/pycalibration/issues/49)
- New feature? I guess maybe.
- Refactor definitely
Gratuitous plotting
As demonstrated in https://git.xfel.eu/gitlab/detectors/pycalibration/merge_requests/446 (closed the old MR as master has changed a lot since then and my changes here are much less work than rebasing), there is room for significant speedups by parallelizing characterize_module
.
One simple strategy is to keep the multiprocessing.Pool
starting characterize_module
and within each call using threading with something like pasha.
Experiment
Setup: default parameters for notebook (some CALLAB proposal), processing modules 0 and 1 (as I think each slurm job would get 16/8 modules).
Looking at the resource usage over time when executing the heaviest cell which does all the characterize_module
work.
First, the version from master:
As it's using multiprocessing.Pool
without explicit worker count, it probably spawned 6 processes as there are six files.
CPU usage is pretty low; this was on max-exfl1237 which can do 72 threads.
Note that even if we had many files, we'd probably run out of memory trying to just scale up the number of processes with a file each.
For comparison, here's the simple version using pasha as presented with the first commit in this MR. Using 6 processes with 12 threads each:
Further improvement
Let's look at the entire notebook (after parallelizing the main characterize_cell
part) with some arbitrary checkpoints:
Breaking this down slightly more shows some candidates for further optimization:
-
Speed up computing the statistics tables - There's a fair bit of computation going on in this last cell
- Commit 05327835 throws parallelism at this; probably way overkill, but speeds that cell up something like 6x
-
Speed up retrieval of old constants - As pointed out by @ahmedk, the notebook currently gets file paths from db, then copies files to new folder, then loads them
- Simply loading the files from wherever they are should already help a lot
-
Simplify data flow - The notebook was heavily influenced by ipyparallel, so some parts could be simpler
- Not sure if this would speed up much, but let's see (likely to at least save memory)
- Update: what I wanted to do here is a pretty big diff and there's not much performance to be gained, so I would leave this out
-
Clean up - Will reorder some things and rename a bit
- Will put back
get_pdu_from_db
Comparing full jobs
For a "real life" test, I ran xfel-calibrate
on the aforementioned data from SPB (p900188
runs 338
, 339
, and 340
) a few times yesterday.
Comparing master
, mr
(pretty much what you see here), and experimental
(where I tried simplifying the data flow in not so important ways apparently).
Note that there was quite some variance: probably file system load during the day and node allocation differences.
And something went wrong with the master
batch started 19:18
, so it didn't do much computation.
If we sum the time elapsed for all 16 jobs in each batch, the difference is pretty stark:
Looking a bit at distribution of time over different jobs in the batches:
Reviewers
I have already made a number of comments for discussion. The major changes 1-4 are ordered roughly according to how much we gain relative to how much complexity they add to the code.