Skip to content
Snippets Groups Projects

Sanitize cal_tools.tools.get_creation_date

Merged Cyril Danilevski requested to merge clean/get_dir_creation_date into master

This is just a little bit of maintenance, as I'm wading through the code.

This MR cleans

It was tested so:

In [1]: from cal_tools.tools import get_dir_creation_date

In [2]: folder = "/gpfs/exfel/exp/DETLAB/202031/p900172/raw"

In [3]: get_dir_creation_date(folder, 10)
Out[3]: datetime.datetime(2020, 7, 20, 12, 39, 20, 631633)

In [4]: get_dir_creation_date(folder, 4)  # no such run, None is returned

In [5]: 

One question maybe is, should this function return None when the date could not be extracted, as it is now, or to raise?
I'd prefer to raise an error, so that it's clear that this step failed, rather than failing later.

@kamile @ahmedk

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • One question maybe is, should this function return None when the date could not be extracted, as it is now, or to raise?
    I'd prefer to raise an error, so that it's clear that this step failed, rather than failing later.

    If the run is not found it will (and should) fail in other parts in whichever notebook as the raw data will not be available for any processing. So, I as well as you prefer to raise an error for failing this step with a proper message.

  • Ebad Kamil
    Ebad Kamil @kamile started a thread on the diff
  • 241 242 :param tsdir: to get modification time of [directory]/[run]04.
    242 243 :param verbosity: Level of verbosity (0 - silent)
    243 244 :return: (datetime) modification time
    245
    244 246 """
    247 directory = Path(directory)
    248 proposal = int(directory.parent.name[1:])
    245 249
    246 250 try:
    247 path_list = list(filter(None, directory.strip('/').split('/')))
    248 proposal = int(path_list[-2][1:])
    249 251 run_info = get_run_info(proposal, run)
    • this use MetadataClient. Do we also have begin_at information somewhere in metadata of the run? We could then simply use extra_data

    • we have begin_at in the metadata saved at myMDC, yes.

      But this part only works in production as communication with MetadataClient needs the secrets written in the webservice yaml configurations.

      We could then simply use extra_data

      How?

      Edited by Karim Ahmed
    •  from extra_data import open_run
       r = open_run(proposal, run)
       r.info() # has some info about number of trains, start, stop time etc, sources, etc..```
    • the new HDF5 files (since ~February) have a few more metadata, they maybe contain all required information?

      example:

      [tmichela@max-exfl001]/gpfs/exfel/exp/SPB/202001/p002450/raw/r0122% h5glance RAW-R0122-DA04-S00000.h5 METADATA
      RAW-R0122-DA04-S00000.h5/METADATA (3 attributes)
      ├creationDate	[ASCII str: 1]
      ├daqLibrary	[ASCII str: 1]
      ├dataFormatVersion	[ASCII str: 1]
      ├dataSources
      │ ├dataSourceId	[ASCII str: 3]
      │ ├deviceId	[ASCII str: 3]
      │ └root	[ASCII str: 3]
      ├karaboFramework	[ASCII str: 1]
      ├proposalNumber	[uint32: 1]
      ├runNumber	[uint32: 1]
      ├sequenceNumber	[uint32: 1]
      └updateDate	[ASCII str: 1]
    • so this would have the same creationDate as MetadataClient or it just shows last time the file was modified?

      Edited by Karim Ahmed
    • the new HDF5 files (since ~February) have a few more metadata, they maybe contain all required information?

      Oh, I didn't know that. That is very useful.

      Edited by Karim Ahmed
    • I'd expect the creationDate field to match the information from the metadata, but I'd check before :D

    • Is there an extra_data way to retrieve it, or is using h5py directly okay?

    • EXtra-data doesn't read these yet (it's on the todo list :) so h5py would be the way to go at the moment

    • Ok, I'm cool to use the creationDate from the files, but there will still be a loop, as not all files are created at once:

      In [1]: import pathlib, h5py
      
      In [2]: p = pathlib.Path('/gpfs/exfel/d/raw/DETLAB/202031/p900172/r0072')
      
      In [3]: for f in p.iterdir():
         ...:     print(f)
         ...: 
      RAW-R0072-AGIPD07-S00000.h5
      RAW-R0072-AGIPD04-S00001.h5
      RAW-R0072-AGIPD05-S00001.h5
      RAW-R0072-AGIPD01-S00000.h5
      RAW-R0072-AGIPD05-S00000.h5
      RAW-R0072-AGIPD02-S00000.h5
      RAW-R0072-AGIPD06-S00000.h5
      RAW-R0072-AGIPD00-S00000.h5
      RAW-R0072-AGIPD04-S00000.h5
      RAW-R0072-AGIPD03-S00000.h5
      RAW-R0072-DA01-S00000.h5
      RAW-R0072-AGIPD00-S00001.h5
      RAW-R0072-AGIPD03-S00001.h5
      RAW-R0072-AGIPD07-S00001.h5
      RAW-R0072-AGIPD01-S00001.h5
      RAW-R0072-AGIPD06-S00001.h5
      RAW-R0072-AGIPD02-S00001.h5
      
      In [4]: dates = []
         ...: for f in p.iterdir():
         ...:     with h5py.File(f) as fin:
         ...:         dates.append(fin['METADATA/creationDate'][0])
      
      In [5]: dates
      Out[6]: 
      [b'20200904T101712Z',
       b'20200904T101737Z',
       b'20200904T101737Z',
       b'20200904T101712Z',
       b'20200904T101712Z',
       b'20200904T101712Z',
       b'20200904T101712Z',
       b'20200904T101712Z',
       b'20200904T101712Z',
       b'20200904T101712Z',
       b'20200904T101715Z',
       b'20200904T101737Z',
       b'20200904T101738Z',
       b'20200904T101737Z',
       b'20200904T101737Z',
       b'20200904T101737Z',
       b'20200904T101738Z']
    • Just to link things up, here's the issue for this in EXtra-data: https://github.com/European-XFEL/EXtra-data/issues/27

      It's also partly stuck on a similar question: how do we efficiently present metadata in 200 separate files as metadata for a run?

    • Please register or sign in to reply
    • Resolved by Cyril Danilevski

      I've updated the code with your suggestions.
      I took the liberty to remove the tsdir argument, after checking that it isn't used anywhere. It can be removed, since we never check when the file were added to disk, but rather use the new METADATA/creationDate.

      I've also added a test. It's not much, but it's honest.

  • One question maybe is, should this function return None when the date could not be extracted, as it is now, or to raise?

    This also ties into a larger question we need to tackle at some point (not in this PR): exceptions when running the notebooks through Slurm are currently ignored - it will try to run the rest of the cells, and the job will finish successfully no matter what happened, except if it hangs.

    There are apparently some situations where this is desirable, so we can't just turn off ignoring errors. But we should at some point define which specific failures are safe to ignore, catch those exceptions, and then re-enable other exceptions.

  • Cyril Danilevski added 2 commits

    added 2 commits

    • 90b592c6 - Sanitize cal_tools.tools.get_creation_date
    • 22f45273 - Add test to get_dir_creation_date

    Compare with previous version

  • Ebad Kamil
  • Adding a test, very nice.

  • added 1 commit

    • d64e76b5 - Remove unecessary assert from test

    Compare with previous version

  • If there are no further comments, I'll merge by the end of the week.

  • @danilevc, I have checked some old data from 2019 and it seems that METADATA/creationDate is not available. Maybe we leave tsdir argument back, in that case for old data.

    /gpfs/exfel/exp/SQS/201930/p900075/raw/r0365/RAW-R0365-PNCCD01-S00001.h5

    Edited by Karim Ahmed
  • added 1 commit

    • 985362db - Add support for older datasets in get_dir_creation_date

    Compare with previous version

  • Nice catch! I've reinstated tsdir as use_dir.

  • added 1 commit

    • 7a862108 - Silently retrieve the creation date from directory

    Compare with previous version

  • We've had a discussion where Karim highlighted that, as the calibration pipeline is launched from a web interface, nobody will ever set the use_dir flag.
    As such, we implicitly use the directory's creation date if the information isn't available neither from MyMDC nor from the dataset, as might be the case for data prior to 2020.

  • neat, thanks!

  • mentioned in commit e235e1e4

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading