[Tests] clearer comparison of HDF5 files
Working on !931 (merged), I found the comparison of HDF5 files against reference files was not very clear. This aims to show all the differences found, not only the first one (i.e. not asserting on each separate step), to present the results in a clearer format, and to simplify the code doing the comparisons.
Based off the branch in !931 (merged), because I know that fails the relevant tests, so I can see what the results look like.
cc @ahmedk
Merge request reports
Activity
Output looks like this (updated after commit bcbfb11d):
--- Compare HDF5 files ---- REF: /gpfs/exfel/data/scratch/xcaltst/test/reference_folder/FXE_XAD_JF1M/FXE_XAD_JF1M-DARK-BURST NEW: /gpfs/exfel/data/scratch/xcaltst/test/test/compare-h5-files/FXE_XAD_JF1M/FXE_XAD_JF1M-DARK-BURST const_BadPixelsDark_Jungfrau_M125.h5 ~ CHANGED: data const_BadPixelsDark_Jungfrau_M233.h5 ~ CHANGED: data const_Noise_Jungfrau_M125.h5 - ✓ no changes const_Noise_Jungfrau_M233.h5 - ✓ no changes const_Offset_Jungfrau_M125.h5 - ✓ no changes const_Offset_Jungfrau_M233.h5 - ✓ no changes
Edited by Thomas KluyverIt's certainly possible, but for testing I actually prefer printed output. By default, pytest collects printed output per test and shows it only if the test fails, so there's less output to look through when just one or two tests are failing.
Relying on that, you can find these blocks by searching for
Compare HDF5
orREF:
.But you use this testing machinery more often than I do, so if you still prefer it to log the output, let me know.
Edited by Thomas Kluyver
added 3 commits
added 2 commits
This aims to show all the differences found, not only the first one (i.e. not asserting on each separate step), to present the results in a clearer format, and to simplify the code doing the comparisons.
Thank you for this MR. I will have a look as soon as possible. I think I did it in this way to save a bit of time in checking files. But what you did has no performance drawbacks. Excited to review :)
added 1 commit
- 1d55b146 - Move assert into test function, smaller stack trace on failure
So I found this hack by experience. But I can't explain why it works and why it doesn't work with other stuff
You go to pipelines and you hit retry on the last pipeline.
All tests will be restarted. And you can configure the manual test again with the new configurations as if you have a fresh pipeline after a new commit
Edited by Karim Ahmed
added 1 commit
- 303cb726 - Only use equal_nan=True to compare float datasets
We're picking up some chages in constants created from AGIPD darks, e.g. looking at one such file:
- file_loc:
proposal:900203 runs:9013 9012 9011
->proposal:900203 runs:9011 9012 9013
(different ordering) - creation_time:
2021-07-30T19:23:18.872347+00:00
->2021-07-30T19:25:51.826227+00:00
We added sorting of the run numbers recently, but the
file_loc
is constructed with the low-med-high run numbers before they're sorted, and creation_time from run_high before sorting. I'm guessing we changed the order of runs in the tests to ensure the sorting worked, and the reference data was generated before that.It also seems to be prone to running out of RAM while loading the arrays to compare, so I probably need to be smarter about that.
- file_loc:
- file_loc:
proposal:900203 runs:9013 9012 9011
->proposal:900203 runs:9011 9012 9013
(different ordering) - creation_time:
2021-07-30T19:23:18.872347+00:00
->2021-07-30T19:25:51.826227+00:00
I am wondering now why this didn't happen before your changes. Have you seen any bugs in this part?
I prefer fixing this in AGIPD dark notebook. We should generate the same constants and the order shouldn't matter. I will fix this separately
- file_loc:
added 1 commit
- b0f1c5f0 - Avoid loading large arrays into memory in one go
mentioned in merge request !938 (closed)
I've fixed it so the comparison doesn't require lots of RAM (not loading entire datasets), and made it show some more information. In particular, where datasets with a single value have changed, we show the old & new values:
--- Compare HDF5 files ---- REF: /gpfs/exfel/data/scratch/xcaltst/test/reference_folder/SPB_DET_AGIPD1M-1/SPB_DET_AGIPD1M-1-DARK-FIXED NEW: /gpfs/exfel/data/scratch/xcaltst/test/test/compare-h5-files/SPB_DET_AGIPD1M-1/SPB_DET_AGIPD1M-1-DARK-FIXED const_BadPixelsDark_AGIPD_SIV1_AGIPDV11_M215.h5 ~ CHANGED: creation_time (Value: b'2021-07-30T19:23:18.872347+00:00' -> b'2021-07-30T19:25:51.826227+00:00') ~ CHANGED: file_loc (Value: b'proposal:900203 runs:9013 9012 9011' -> b'proposal:900203 runs:9011 9012 9013') const_BadPixelsDark_AGIPD_SIV1_AGIPDV11_M234.h5 ~ CHANGED: creation_time (Value: b'2021-07-30T19:23:18.872347+00:00' -> b'2021-07-30T19:25:51.826227+00:00') ~ CHANGED: file_loc (Value: b'proposal:900203 runs:9013 9012 9011' -> b'proposal:900203 runs:9011 9012 9013')
- Resolved by Karim Ahmed
- Resolved by Karim Ahmed
- Resolved by Karim Ahmed
I am not sure if you mentioned this anywhere. But this MR improves the performance of the pipeline considerably. looking fast at some other MRs when all of the reference runs are tested. I see around 2.5 hours of improvement
Again thanks!, Thomas, and besides the few comments. LGTM!
Edited by Karim AhmedExcellent! I thought there might be a performance benefit, but I hadn't got round to checking.
The previous code was using a thread pool, but unfortunately this doesn't help with HDF5. HDF5 is thread safe, but it works by putting a big lock round the whole library, so only one thread can be making HDF5 calls at a time, even if they're working on different files. So you normally need multiple processes to speed up accessing HDF5 files.
(HDF group's answer is usually to use 'parallel HDF5', i.e. MPI. That works, even from Python, but it's more complex.)