[Tests] clearer comparison of HDF5 files

Output looks like this (updated after commit bcbfb11d):

--- Compare HDF5 files  ----
REF: /gpfs/exfel/data/scratch/xcaltst/test/reference_folder/FXE_XAD_JF1M/FXE_XAD_JF1M-DARK-BURST
NEW: /gpfs/exfel/data/scratch/xcaltst/test/test/compare-h5-files/FXE_XAD_JF1M/FXE_XAD_JF1M-DARK-BURST
const_BadPixelsDark_Jungfrau_M125.h5
  ~ CHANGED: data
const_BadPixelsDark_Jungfrau_M233.h5
  ~ CHANGED: data
const_Noise_Jungfrau_M125.h5 - ✓ no changes
const_Noise_Jungfrau_M233.h5 - ✓ no changes
const_Offset_Jungfrau_M125.h5 - ✓ no changes
const_Offset_Jungfrau_M233.h5 - ✓ no changes

Would it be possible to use the LOGGER instead of printing? By using error one can search for FAILED and find these faster.

It's certainly possible, but for testing I actually prefer printed output. By default, pytest collects printed output per test and shows it only if the test fails, so there's less output to look through when just one or two tests are failing.

Relying on that, you can find these blocks by searching for Compare HDF5 or REF: .

But you use this testing machinery more often than I do, so if you still prefer it to log the output, let me know.

added 3 commits

3dbc0f08 - Tweak comparison formatting
3741c366 - Remove old code
a4f8cba7 - Format test code

Compare with previous version

added 2 commits

a5cd9167 - Remove unused variable
bcbfb11d - Remove extra function argument

Compare with previous version

marked this merge request as ready

changed title from Test/compare h5 files to [Tests] clearer comparison of HDF5 files

changed the description

This aims to show all the differences found, not only the first one (i.e. not asserting on each separate step), to present the results in a clearer format, and to simplify the code doing the comparisons.

Thank you for this MR. I will have a look as soon as possible. I think I did it in this way to save a bit of time in checking files. But what you did has no performance drawbacks. Excited to review :)

added 1 commit

1d55b146 - Move assert into test function, smaller stack trace on failure

Compare with previous version

I reran the test for all reference runs, which failed for different reasons. I think you might have already realized. I just added a comment in case GitLab didn't notify you.

Thanks, Gitlab didn't notify me about the test run.

Also, have you found a way to re-run the tests with different parameters? I went looking for that, but once I'd run it on a given commit, I could only see how to retry with the same parameters.

So I found this hack by experience. But I can't explain why it works and why it doesn't work with other stuff

You go to pipelines and you hit retry on the last pipeline.

All tests will be restarted. And you can configure the manual test again with the new configurations as if you have a fresh pipeline after a new commit

Aha, thanks, that makes some sense. I had been retrying the job rather than the pipeline.

added 1 commit

303cb726 - Only use equal_nan=True to compare float datasets

Compare with previous version

We're picking up some chages in constants created from AGIPD darks, e.g. looking at one such file:

file_loc: proposal:900203 runs:9013 9012 9011 -> proposal:900203 runs:9011 9012 9013 (different ordering)
creation_time: 2021-07-30T19:23:18.872347+00:00 -> 2021-07-30T19:25:51.826227+00:00

We added sorting of the run numbers recently, but the file_loc is constructed with the low-med-high run numbers before they're sorted, and creation_time from run_high before sorting. I'm guessing we changed the order of runs in the tests to ensure the sorting worked, and the reference data was generated before that.

It also seems to be prone to running out of RAM while loading the arrays to compare, so I probably need to be smarter about that.

file_loc: proposal:900203 runs:9013 9012 9011 -> proposal:900203 runs:9011 9012 9013 (different ordering)

creation_time: 2021-07-30T19:23:18.872347+00:00 -> 2021-07-30T19:25:51.826227+00:00

I am wondering now why this didn't happen before your changes. Have you seen any bugs in this part?

I prefer fixing this in AGIPD dark notebook. We should generate the same constants and the order shouldn't matter. I will fix this separately

Yeah, I'm not sure why it didn't fail before.

We should generate the same constants and the order shouldn't matter. I will fix this separately.

Agreed, thanks

added 1 commit

b0f1c5f0 - Avoid loading large arrays into memory in one go

Compare with previous version

mentioned in merge request !938 (closed)

added 3 commits

bd69c051 - Silence some line-length warnings
4328915b - Handle datasets not in chunked layout
83ac2288 - Fix comparison of strings in HDF5 files

Compare with previous version

I've fixed it so the comparison doesn't require lots of RAM (not loading entire datasets), and made it show some more information. In particular, where datasets with a single value have changed, we show the old & new values:

--- Compare HDF5 files  ----
REF: /gpfs/exfel/data/scratch/xcaltst/test/reference_folder/SPB_DET_AGIPD1M-1/SPB_DET_AGIPD1M-1-DARK-FIXED
NEW: /gpfs/exfel/data/scratch/xcaltst/test/test/compare-h5-files/SPB_DET_AGIPD1M-1/SPB_DET_AGIPD1M-1-DARK-FIXED
const_BadPixelsDark_AGIPD_SIV1_AGIPDV11_M215.h5
  ~ CHANGED: creation_time (Value: b'2021-07-30T19:23:18.872347+00:00' -> b'2021-07-30T19:25:51.826227+00:00')
  ~ CHANGED: file_loc (Value: b'proposal:900203 runs:9013 9012 9011' -> b'proposal:900203 runs:9011 9012 9013')
const_BadPixelsDark_AGIPD_SIV1_AGIPDV11_M234.h5
  ~ CHANGED: creation_time (Value: b'2021-07-30T19:23:18.872347+00:00' -> b'2021-07-30T19:25:51.826227+00:00')
  ~ CHANGED: file_loc (Value: b'proposal:900203 runs:9013 9012 9011' -> b'proposal:900203 runs:9011 9012 9013')

I am not sure if you mentioned this anywhere. But this MR improves the performance of the pipeline considerably. looking fast at some other MRs when all of the reference runs are tested. I see around 2.5 hours of improvement

Again thanks!, Thomas, and besides the few comments. LGTM!

Excellent! I thought there might be a performance benefit, but I hadn't got round to checking.

The previous code was using a thread pool, but unfortunately this doesn't help with HDF5. HDF5 is thread safe, but it works by putting a big lock round the whole library, so only one thread can be making HDF5 calls at a time, even if they're working on different files. So you normally need multiple processes to speed up accessing HDF5 files.

(HDF group's answer is usually to use 'parallel HDF5', i.e. MPI. That works, even from Python, but it's more complex.)

added 3 commits

d61bc8c9 - Clean up removed parameter from docstring
ada36c34 - Remove unused test_key parameter
98cea14d - Reformat long line

Compare with previous version

[Tests] clearer comparison of HDF5 files

Merge request reports

Activity