[AGIPD][CORRECT] Fixes for correcting AGIPD HIBEF data
SUMMARY:
Changes are done by Jola and I, for running the correction notebook for AGIPD HIBEF.
- New plot changes.
- Bugfix for default
fillvalue
ofnp.nan
from extra-data was not converted to 0. - Changing the method of reading data from h5 file for fixing very slow data reading.
Tests:
- Raw data calibrated for testing directly running the notebook:
- /gpfs/exfel/exp/HED/202031/p900174/raw/r0155
- /gpfs/exfel/exp/MID/201901/p002542/raw/r0229
- /gpfs/exfel/exp/SPB/202030/p900119/raw/r0098
Testing using xfel-calibrate
CLI:
xfel-calibrate AGIPD CORRECT \
--slurm-mem 750 \
--slurm-name cor_HED_900174_r155 \
--karabo-da -1 \
--receiver-id {}CH0 \
--karabo-id-control HED_EXP_AGIPD500K2G \
--karabo-da-control AGIPD500K2G00 \
--h5path-ctrl /CONTROL/{}/MDL/FPGA_COMP \
--overwrite \
--sequences-per-node 1 \
--in-folder /gpfs/exfel/exp/HED/202031/p900174/raw/ \
--out-folder /gpfs/exfel/data/scratch/ahmedk/test/AGIPD_HIBEF \
--karabo-id HED_DET_AGIPD500K2G \
--run 155 \
--force-hg-if-below \
--force-mg-if-below \
--low-medium-gap \
--zero-nans \
--modules 0-7 \
--acq-rate 4.5 \
--bias-voltage 200 \
--gain-setting 0 \
--only-offset \
--sequences 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15\
Details related to the 3rd point:
There was an issue that reading image data for the HIBEF AGIPD was taking too long (> 2000 seconds)
It was found out there are restrictions when reading data using indexing (parameter:firange
) as we do in agipdlib.read_file
function.
I tested with all three mentioned raw data, which had more than 1000 list of indices while load data. But the issue of reading performance was only with HIBEF. Reading the whole image then indexing later was faster by a big factor.
What differs between the HIBEF and the SPB runs is not the number of list of indices used (SPB is bigger) but the fact that the list of indices used while reading data is not increasing because of the many empty pulses present for HIBEF raw data.
Also, the mentioned run for HIBEF was tested for sequence 16 which had zeros counts of 7 and more that is why 2 was used.
Here https://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing it was mentioned regarding fancy indexing that
The following restrictions exist:
Selection coordinates must be given in increasing order Duplicate selections are ignored Very long lists (> 1000 elements) may produce poor performance
Reviewers:
Merge request reports
Activity
added 9 commits
-
efff9368...dba31abb - 8 commits from branch
master
- ea7eaa49 - resolve conflict AGIPD-minhalf
-
efff9368...dba31abb - 8 commits from branch
- Resolved by Karim Ahmed
added 1 commit
- 51cf0a2f - Apply suggestion to cal_tools/cal_tools/agipdlib.py
172 172 "print(f\"Gain setting: {gain_setting}\")\n", 173 173 "print(f\"Detector in use is {karabo_id}\")\n", 174 174 "\n", 175 "\n", 176 "# Extracting Instrument string\n", 177 "instrument = karabo_id.split(\"_\")[0]\n", 178 "# Evaluate detector instance for mapping\n", 179 "if instrument == \"SPB\":\n", 180 " dinstance = \"AGIPD1M1\"\n", 181 " nmods = 16\n", 182 "elif instrument == \"MID\":\n", 183 " dinstance = \"AGIPD1M2\"\n", 184 " nmods = 16\n", 185 "# TODO: Remove DETLAB\n", 186 "elif instrument == \"HED\" or instrument == \"DETLAB\":\n", - Resolved by Karim Ahmed
added 1 commit
- 4e880079 - Apply suggestion to notebooks/AGIPD/AGIPD_Retrieve_Constants_Precorrection.ipynb
287 289 group['trainId'][firange]) 288 290 data_dict['moduleIdx'][0] = module_idx 289 291 data_dict['nImg'][0] = n_img 290 291 raw_data = group['data'][firange] 292 # Avoid very slow performance using fancing indexing, 293 # if firange was bigger than 1000 and consists of 294 # non-incrementing indices 295 if (n_img > 1000 and 296 all(y-x==1 for x,y in zip(firange, firange[1:]))): Changed the check for non-incrementing indices in
firange
, to avoid slow reading performance.Edited by Karim AhmedThis check takes around 0.2 seconds in case of a
True
result. And in that case, fancy indexing can be faster by about 10 seconds for n_img > 8000Edited by Karim AhmedYou can probably make the check much faster by using numpy vectorised operations rather than iterating over pairs of numbers in Python - something like:
np.all(np.diff(firange) == 1)
Going one step further, we could probably skip the check altogether if
gen_valid_range()
returned either an array or a slice object. I.e. replacingnp.arange()
here withslice()
:You can probably make the check much faster by using numpy vectorised operations rather than iterating over pairs of numbers in Python
Yes, it is indeed way faster, thank you! (about 1ms)
Going one step further, we could probably skip the check altogether if
gen_valid_range()
returned either an array or a slice object. I.e. replacingnp.arange()
here withslice()
:yes, true. I thought about doing the same. the issue here is with https://git.xfel.eu/gitlab/detectors/pycalibration/blob/b64962482e81fd89b32f32259f5324849d09e09f/cal_tools/cal_tools/agipdlib.py#L608-656
count == 0 are considered as invalid and this function returns a list of valid_indices after filtering the empty pulses. Which is occurring a lot for AGIPD-HIBEF as we saw last time.
How can we return a slice object for non-incrementing indices?
Edited by Karim AhmedAh, I'd missed that it would never take the
valid_indices is None
branch (for data with the new index format). Yes, in that case, checking if we've got an array of contiguous indices makes sense.Looking at the code around filtering trains, though, I suspect we're doing something wrong with the index in the output files when a train gets filtered out. That's not related to this PR, though.
changed this line in version 14 of the diff
OK, I think it is actually doing the right thing, I was just finding it difficult to follow this code:
We might want to look at using EXtra-data for reading and writing files at some point, though it probably needs some changes to handle writing. Here's the corresponding code for creating a
first, count
index in EXtra-data:They're not quite the same - I think the code in pycalibration can handle train IDs that aren't in increasing order, whereas EXtra-data will refuse that. On the other hand, I think the pycalibration code will produce bad output if a train ID repeats like
[1, 2, 1]
- though that should be impossible. I'm aware that any change is a risky business so long as what we have works!yes, there were old discussions on what should be the action in case of duplicated trains, and using extra-data was even considered, and then we had the bigger discussion of using extra-data to validate raw data before correction or even validating after correction and sending Emails in case of no validation.
I guess this MR is related to it: https://git.xfel.eu/gitlab/detectors/pycalibration/merge_requests/296
Anyhow, I have updated the MR with the NumPy vectorized operations, but I will not resolve this discussion as I would like to leave your last comment unhidden. I see it very useful.
Edited by Karim Ahmed
added 1 commit
- 45d7f9e4 - use numpy vectorised operations rather than iterating over pairs of numbers