Skip to content
Snippets Groups Projects

[Tests] clearer comparison of HDF5 files

Merged Thomas Kluyver requested to merge test/compare-h5-files into master

Working on !931 (merged), I found the comparison of HDF5 files against reference files was not very clear. This aims to show all the differences found, not only the first one (i.e. not asserting on each separate step), to present the results in a clearer format, and to simplify the code doing the comparisons.

Based off the branch in !931 (merged), because I know that fails the relevant tests, so I can see what the results look like.

cc @ahmedk

Edited by Thomas Kluyver

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
    • Output looks like this (updated after commit bcbfb11d):

      --- Compare HDF5 files  ----
      REF: /gpfs/exfel/data/scratch/xcaltst/test/reference_folder/FXE_XAD_JF1M/FXE_XAD_JF1M-DARK-BURST
      NEW: /gpfs/exfel/data/scratch/xcaltst/test/test/compare-h5-files/FXE_XAD_JF1M/FXE_XAD_JF1M-DARK-BURST
      const_BadPixelsDark_Jungfrau_M125.h5
        ~ CHANGED: data
      const_BadPixelsDark_Jungfrau_M233.h5
        ~ CHANGED: data
      const_Noise_Jungfrau_M125.h5 - ✓ no changes
      const_Noise_Jungfrau_M233.h5 - ✓ no changes
      const_Offset_Jungfrau_M125.h5 - ✓ no changes
      const_Offset_Jungfrau_M233.h5 - ✓ no changes
      Edited by Thomas Kluyver
    • Would it be possible to use the LOGGER instead of printing? By using error one can search for FAILED and find these faster.

    • It's certainly possible, but for testing I actually prefer printed output. By default, pytest collects printed output per test and shows it only if the test fails, so there's less output to look through when just one or two tests are failing.

      Relying on that, you can find these blocks by searching for Compare HDF5 or REF: .

      But you use this testing machinery more often than I do, so if you still prefer it to log the output, let me know.

      Edited by Thomas Kluyver
    • Please register or sign in to reply
  • Thomas Kluyver added 3 commits

    added 3 commits

    Compare with previous version

  • Thomas Kluyver added 2 commits

    added 2 commits

    Compare with previous version

  • Thomas Kluyver marked this merge request as ready

    marked this merge request as ready

  • Thomas Kluyver changed title from Test/compare h5 files to [Tests] clearer comparison of HDF5 files

    changed title from Test/compare h5 files to [Tests] clearer comparison of HDF5 files

  • Thomas Kluyver changed the description

    changed the description

  • This aims to show all the differences found, not only the first one (i.e. not asserting on each separate step), to present the results in a clearer format, and to simplify the code doing the comparisons.

    Thank you for this MR. I will have a look as soon as possible. I think I did it in this way to save a bit of time in checking files. But what you did has no performance drawbacks. Excited to review :)

  • Thomas Kluyver added 1 commit

    added 1 commit

    • 1d55b146 - Move assert into test function, smaller stack trace on failure

    Compare with previous version

    • I reran the test for all reference runs, which failed for different reasons. I think you might have already realized. I just added a comment in case GitLab didn't notify you.

    • Thanks, Gitlab didn't notify me about the test run.

      Also, have you found a way to re-run the tests with different parameters? I went looking for that, but once I'd run it on a given commit, I could only see how to retry with the same parameters.

    • So I found this hack by experience. But I can't explain why it works and why it doesn't work with other stuff

      You go to pipelines and you hit retry on the last pipeline.

      Screenshot_from_2023-11-30_09-49-35

      All tests will be restarted. And you can configure the manual test again with the new configurations as if you have a fresh pipeline after a new commit

      Edited by Karim Ahmed
    • Aha, thanks, that makes some sense. I had been retrying the job rather than the pipeline.

    • Please register or sign in to reply
  • Thomas Kluyver added 1 commit

    added 1 commit

    • 303cb726 - Only use equal_nan=True to compare float datasets

    Compare with previous version

    • We're picking up some chages in constants created from AGIPD darks, e.g. looking at one such file:

      • file_loc: proposal:900203 runs:9013 9012 9011 -> proposal:900203 runs:9011 9012 9013 (different ordering)
      • creation_time: 2021-07-30T19:23:18.872347+00:00 -> 2021-07-30T19:25:51.826227+00:00

      We added sorting of the run numbers recently, but the file_loc is constructed with the low-med-high run numbers before they're sorted, and creation_time from run_high before sorting. I'm guessing we changed the order of runs in the tests to ensure the sorting worked, and the reference data was generated before that.

      It also seems to be prone to running out of RAM while loading the arrays to compare, so I probably need to be smarter about that.

      • file_loc: proposal:900203 runs:9013 9012 9011 -> proposal:900203 runs:9011 9012 9013 (different ordering)
      • creation_time: 2021-07-30T19:23:18.872347+00:00 -> 2021-07-30T19:25:51.826227+00:00

      I am wondering now why this didn't happen before your changes. Have you seen any bugs in this part?

      I prefer fixing this in AGIPD dark notebook. We should generate the same constants and the order shouldn't matter. I will fix this separately

    • Yeah, I'm not sure why it didn't fail before.

      We should generate the same constants and the order shouldn't matter. I will fix this separately.

      Agreed, thanks :thumbsup:

    • Please register or sign in to reply
  • Thomas Kluyver added 1 commit

    added 1 commit

    • b0f1c5f0 - Avoid loading large arrays into memory in one go

    Compare with previous version

  • Karim Ahmed mentioned in merge request !938 (closed)

    mentioned in merge request !938 (closed)

  • Thomas Kluyver added 3 commits

    added 3 commits

    • bd69c051 - Silence some line-length warnings
    • 4328915b - Handle datasets not in chunked layout
    • 83ac2288 - Fix comparison of strings in HDF5 files

    Compare with previous version

  • I've fixed it so the comparison doesn't require lots of RAM (not loading entire datasets), and made it show some more information. In particular, where datasets with a single value have changed, we show the old & new values:

    --- Compare HDF5 files  ----
    REF: /gpfs/exfel/data/scratch/xcaltst/test/reference_folder/SPB_DET_AGIPD1M-1/SPB_DET_AGIPD1M-1-DARK-FIXED
    NEW: /gpfs/exfel/data/scratch/xcaltst/test/test/compare-h5-files/SPB_DET_AGIPD1M-1/SPB_DET_AGIPD1M-1-DARK-FIXED
    const_BadPixelsDark_AGIPD_SIV1_AGIPDV11_M215.h5
      ~ CHANGED: creation_time (Value: b'2021-07-30T19:23:18.872347+00:00' -> b'2021-07-30T19:25:51.826227+00:00')
      ~ CHANGED: file_loc (Value: b'proposal:900203 runs:9013 9012 9011' -> b'proposal:900203 runs:9011 9012 9013')
    const_BadPixelsDark_AGIPD_SIV1_AGIPDV11_M234.h5
      ~ CHANGED: creation_time (Value: b'2021-07-30T19:23:18.872347+00:00' -> b'2021-07-30T19:25:51.826227+00:00')
      ~ CHANGED: file_loc (Value: b'proposal:900203 runs:9013 9012 9011' -> b'proposal:900203 runs:9011 9012 9013')
  • Karim Ahmed
  • Karim Ahmed
  • Karim Ahmed
    • I am not sure if you mentioned this anywhere. But this MR improves the performance of the pipeline considerably. looking fast at some other MRs when all of the reference runs are tested. I see around 2.5 hours of improvement :confetti_ball:

      Again thanks!, Thomas, and besides the few comments. LGTM!

      Edited by Karim Ahmed
    • Excellent! I thought there might be a performance benefit, but I hadn't got round to checking.

      The previous code was using a thread pool, but unfortunately this doesn't help with HDF5. HDF5 is thread safe, but it works by putting a big lock round the whole library, so only one thread can be making HDF5 calls at a time, even if they're working on different files. So you normally need multiple processes to speed up accessing HDF5 files.

      (HDF group's answer is usually to use 'parallel HDF5', i.e. MPI. That works, even from Python, but it's more complex.)

    • Please register or sign in to reply
  • Thomas Kluyver added 3 commits

    added 3 commits

    • d61bc8c9 - Clean up removed parameter from docstring
    • ada36c34 - Remove unused test_key parameter
    • 98cea14d - Reformat long line

    Compare with previous version

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading