[EPIX100] Reading ePix100 data with EXtra-data, Correct and Dark notebooks. (!500) · Merge requests · calibration / pycalibration

Merged Karim Ahmed requested to merge feat/read_with_EXtra_data into master 3 years ago

ePix100 correction and dark notebooks with EXtra-data.

Description

The main changes were to replace reading sequence files with h5py directly and use EXtra-data for reading all available sequences to correct and produce one sequence file or to create dark constants.

This as well replaces the ChunkReader functions from pyDetLib with a simple numpy mean and std for calculating the dark constants and adding pasha for correcting the image trains.

1. Epix100 Correction/dark (small detector)
2. Remove db-module parameter from calibration_configurations. https://git.xfel.eu/detectors/calibration_configurations/-/merge_requests/17
3. profiling numbers for the DataCollection.select(..., require_all=True)

How Has This Been Tested?

ePix100-dark:

in-folder = "/gpfs/exfel/exp/HED/202030/p900136/raw" run = 182 karabo_id = HED_IA1_EPX100-2 karabo_da = EPIX02

There are 1000 trains in sequence 0, which is used for generating darks in the current pycalibration release.

Data quality:

The tests were done by comparing the produced constants out of the old/new implementations. To validate that the constants were not affected by any mistake. (np.allclose was used to validate Noise and Offset)

Performance:

Previously, data was read and corrected using pyDetLib functions (ChunkReader and fastccdReader). The cell took about > 45 seconds for 1000 trains and chunk size 100.

Compared to updated implementation using EXtra-data and chunking using .split_trains(), which took about ~ 24 seconds for 1000 trains and chunk size 100.

Documents:

Before: EPIX100DARKCalibration_master.pdf

After: EPIX100DARKCalibration_EXtra-data.pdf

ePix100-correct:

Data quality:

The tests were done by comparing the produced corrected data out of the old/new implementations. To validate that the corrections were not affected by any mistake. (np.allclose was used to validate the corrected files sequence 000000)

Documents + SLURM time report performance:

A comparison between 4 implementations were done.

The raw data used was : /gpfs/exfel/exp/HED/202002/p002710/raw/r00435 This data was used for the test as it consists of 4 sequences. A total of 3813 trains, with the first 3 sequence file consisting of 1000 trains.

Master: EPIX100CORRECTCalibration.pdf
Extra-data to produce one corrected file for all sequences :EPIX100CORRECT-NORMALCalibration.pdf Correction_ePix100_NBC_serial.ipynb
Extra-data + Pasha to produce one corrected file for all sequences :EPIX100CORRECT-PASHACalibration.pdf Correction_ePix100_NBC_pasha.ipynb
(Current branch)Extra-data + Pasha to produce multiple corrected files for each sequence: EPIX100CORRECTCalibration.pdf 1 sequence file, no big difference in performance but with EXtra-data is about 5 seconds slower.

Below is a plot for the performance comparison time wise.

Moving to Extra-data is useful not to depend on the number of sequences available or the name of the files.

But as can be seen from the plot. Preparing the correcting file (copying and sanitizing data) and reading control data took more time which is not affecting much and was kind of expected.

But trying to correct all trains and save it to one corrected file, proved to be longer for 4 sequences by about 2X. Using Pasha speedup a bit the processing but it was still slower than the master by about 1.5X.

The last implementation was about using H5File to correct sequences per slurm node. Keeping the same level of parallelization while using Pasha as well instead of pyDetlib to correct trains resulted small performance gain. By about 0.8X master for 4 sequences.

Note: plotting is the same and the perfomance is based on the number of images available. For the Extra-data + pasha for 1 corr file all 3813 images were available. The rest either only trains for sequence file is available or the last chunk of images.