Parallelise gain/mask compression for writing corrected AGIPD files
Description
Writing corrected files is currently a bottleneck for AGIPD correction. With the parameters I'm testing with (mostly the defaults from the notebook), typical timings are something like 40 seconds reading raw data, 45 seconds processing (no common mode), 150 seconds writing. Watching it while writing, you mostly see 4 cores at 100% (writing 4 files) and the rest at almost 0.
HDF5 can't write in parallel without using MPI. However, I guessed that a large chunk of the 'writing' time was actually compressing the gain & mask data - they both compress very well, but DEFLATE is slow. However, we can compress the data ourselves and use a lower-level API to write it to HDF5, allowing us to parallelise the compression.
It seems to work - in my tests, writing each batch of files went from about 150 seconds to about 60, reducing the load-correct-write time by about 40% (or ~30% of the total time for the Slurm jobs).
However, I have also disabled the fletcher32 checksum algorithm for the compressed (gain & mask) data. If this is important, we would have to find or implement a reasonably fast version of this which could be called from Python.
How Has This Been Tested?
Run from the command line on Maxwell:
xfel-calibrate AGIPD CORRECT --in-folder /gpfs/exfel/exp/SPB/202130/p900201/raw --run 203 --karabo-id SPB_DET_AGIPD1M-1 --out-folder /gpfs/exfel/data/scratch/kluyvert/agipd-calib-900201-203-pcomp2 --karabo-id-control SPB_IRU_AGIPD1M1 --karabo-da-control AGIPD1MCTRL00 --sequences 0-4
No obvious differences in the reports, and reading a sample of the gain & mask data worked, and gave the same data as the same from master.
Relevant Documents (optional)
Here's what it looks like in htop. We never max out all the cores - this is about the highest it gets - but it's much better than using only 4 cores:
Types of changes
- Performance improvement
- Breaking change (removing checksums)
Checklist:
- My code follows the code style of this project.
Reviewers
Merge request reports
Activity
- Resolved by Philipp Schmidt
To expand a bit more on the checksum: there are various implementations around, including on Wikipedia and in HDF5 itself which we could probably put into Cython without much difficulty. I believe this is applied to the compressed data, which is about 100x smaller than the uncompressed data, so it probably doesn't need to be particularly optimised.
But before I get into that, I'd like to ask: is a checksum in HDF5 valuable? Or are the filesystems reliable enough that we can skip it? I could imagine that the checksum was enabled because it's basically free, and it doesn't seem to be present on raw data. On the other hand, if the filesystem tells us that an entire file is invalid, it might be nice to have per-chunk checksums telling us that 99% of the data in that file is OK.
From the discussion today, I think we concluded that the HDF5 checksums are not that important for corrected data, because:
- GPFS has its own checksumming, and mechanisms to repair errors
- Corrected data is not meant to be preserved for the long term
- It should be possible to recreate corrected data exactly from raw data (whether or not it is yet)
HDF5's checksumming might be more valuable for raw data, which is preserved for longer, if it's fast enough that it doesn't cause problems for the DAQ.
So I think this is ready for review.
From GPFS docu: End-to-end checksum
Most implementations of RAID codes implicitly assume that disks reliably detect and report faults, hard-read errors, and other integrity problems. However, studies have shown that disks do not report some read faults and occasionally fail to write data, while actually claiming to have written the data. These errors are often referred to as silent errors, phantom-writes, dropped-writes, and off-track writes. To cover for these shortcomings, IBM Spectrum Scale RAID implements an end-to-end checksum that can detect silent data corruption caused by either disks or other system components that transport or manipulate the data.
When an NSD client is writing data, a checksum of 8 bytes is calculated and appended to the data before it is transported over the network to the IBM Spectrum Scale RAID server. On reception, IBM Spectrum Scale RAID calculates and verifies the checksum. Then, IBM Spectrum Scale RAID stores the data, a checksum, and version number to disk and logs the version number in its metadata for future verification during read.
When IBM Spectrum Scale RAID reads disks to satisfy a client read operation, it compares the disk checksum against the disk data and the disk checksum version number against what is stored in its metadata. If the checksums and version numbers match, IBM Spectrum Scale RAID sends the data along with a checksum to the NSD client. If the checksum or version numbers are invalid, IBM Spectrum Scale RAID reconstructs the data using parity or replication and returns the reconstructed data and a newly generated checksum to the client. Thus, both silent disk read errors and lost or missing disk writes are detected and corrected.
Correct the rebrand was done a few years ago, and people still use the old name GPFS to avoid using the abbreviation of Spectrum Scale (that is my personal interpretation). Please note that we are using the IBM Spectrum Scale RAID which in addition to just IBM Spectrum Scale has the features of End-to-end checksum, data redundancy of a Reed-Solomon code or N-way replication and declustered array which results in very sort rebuild time if a disk is broken. The RAID version is available with dedicated hardware. And IBM Spectrum Scale RAID plus dedicated hardware is called: IBM Elastic Storage Server - ESS.
- Resolved by Philipp Schmidt
- Resolved by Philipp Schmidt
- Resolved by Thomas Kluyver
For the case I was testing with, about 30% improvement (the correction jobs went from ~34 minutes to ~22, doing 16 modules x 2 sequences each). The proportion shouldn't change much with number of sequence files, because you get a similar saving on writing each sequence file.
This is probably more or less the best case, because the default settings in the notebook (as I tested) turn off lots of optional parts of the correction, like common mode and baseline correction. Obviously the longer it spends on actual processing, the smaller speedup this gives you.
Edited by Thomas Kluyverchanged milestone to %3.4.0
mentioned in commit 5d46f8e6