Parallelise gain/mask compression for writing corrected AGIPD files

To expand a bit more on the checksum: there are various implementations around, including on Wikipedia and in HDF5 itself which we could probably put into Cython without much difficulty. I believe this is applied to the compressed data, which is about 100x smaller than the uncompressed data, so it probably doesn't need to be particularly optimised.

But before I get into that, I'd like to ask: is a checksum in HDF5 valuable? Or are the filesystems reliable enough that we can skip it? I could imagine that the checksum was enabled because it's basically free, and it doesn't seem to be present on raw data. On the other hand, if the filesystem tells us that an entire file is invalid, it might be nice to have per-chunk checksums telling us that 99% of the data in that file is OK.

This is amazing

Apart from the checksum question, this is great. As a small side note, since we want to port over all calibration notebooks over to use EXtra-data, it'd be great to move this over at some point.

From the discussion today, I think we concluded that the HDF5 checksums are not that important for corrected data, because:

GPFS has its own checksumming, and mechanisms to repair errors
Corrected data is not meant to be preserved for the long term
It should be possible to recreate corrected data exactly from raw data (whether or not it is yet)

HDF5's checksumming might be more valuable for raw data, which is preserved for longer, if it's fast enough that it doesn't cause problems for the DAQ.

So I think this is ready for review.

From GPFS docu: End-to-end checksum

Most implementations of RAID codes implicitly assume that disks reliably detect and report faults, hard-read errors, and other integrity problems. However, studies have shown that disks do not report some read faults and occasionally fail to write data, while actually claiming to have written the data. These errors are often referred to as silent errors, phantom-writes, dropped-writes, and off-track writes. To cover for these shortcomings, IBM Spectrum Scale RAID implements an end-to-end checksum that can detect silent data corruption caused by either disks or other system components that transport or manipulate the data.

When an NSD client is writing data, a checksum of 8 bytes is calculated and appended to the data before it is transported over the network to the IBM Spectrum Scale RAID server. On reception, IBM Spectrum Scale RAID calculates and verifies the checksum. Then, IBM Spectrum Scale RAID stores the data, a checksum, and version number to disk and logs the version number in its metadata for future verification during read.

When IBM Spectrum Scale RAID reads disks to satisfy a client read operation, it compares the disk checksum against the disk data and the disk checksum version number against what is stored in its metadata. If the checksums and version numbers match, IBM Spectrum Scale RAID sends the data along with a checksum to the NSD client. If the checksum or version numbers are invalid, IBM Spectrum Scale RAID reconstructs the data using parity or replication and returns the reconstructed data and a newly generated checksum to the client. Thus, both silent disk read errors and lost or missing disk writes are detected and corrected.

Thanks Janusz. Just to note for anyone who hasn't already run into this: GPFS and 'IBM Spectrum Scale' are the same thing.

Correct the rebrand was done a few years ago, and people still use the old name GPFS to avoid using the abbreviation of Spectrum Scale (that is my personal interpretation). Please note that we are using the IBM Spectrum Scale RAID which in addition to just IBM Spectrum Scale has the features of End-to-end checksum, data redundancy of a Reed-Solomon code or N-way replication and declustered array which results in very sort rebuild time if a disk is broken. The RAID version is available with dedicated hardware. And IBM Spectrum Scale RAID plus dedicated hardware is called: IBM Elastic Storage Server - ESS.

added 1 commit

ac5961d0 - Make the dataset options clearer

Compare with previous version

Thanks, I'm happy with this. That's a really neat idea.

Do you have some numbers handy how much this improves the total processing time for a given number of sequences?

For the case I was testing with, about 30% improvement (the correction jobs went from ~34 minutes to ~22, doing 16 modules x 2 sequences each). The proportion shouldn't change much with number of sequence files, because you get a similar saving on writing each sequence file.

This is probably more or less the best case, because the default settings in the notebook (as I tested) turn off lots of optional parts of the correction, like common mode and baseline correction. Obviously the longer it spends on actual processing, the smaller speedup this gives you.

"Up to 30%", noted

changed milestone to %3.4.0

LGTM

merged

mentioned in commit 5d46f8e6

Parallelise gain/mask compression for writing corrected AGIPD files

Description

How Has This Been Tested?

Relevant Documents (optional)

Types of changes

Checklist:

Reviewers

Activity

Parallelise gain/mask compression for writing corrected AGIPD files

Description

How Has This Been Tested?

Relevant Documents (optional)

Types of changes

Checklist:

Reviewers

Merge request reports

Activity