Sanitize cal_tools.tools.get_creation_date
This is just a little bit of maintenance, as I'm wading through the code.
This MR cleans
It was tested so:
In [1]: from cal_tools.tools import get_dir_creation_date
In [2]: folder = "/gpfs/exfel/exp/DETLAB/202031/p900172/raw"
In [3]: get_dir_creation_date(folder, 10)
Out[3]: datetime.datetime(2020, 7, 20, 12, 39, 20, 631633)
In [4]: get_dir_creation_date(folder, 4) # no such run, None is returned
In [5]:
One question maybe is, should this function return None
when the date could not be extracted, as it is now, or to raise?
I'd prefer to raise an error, so that it's clear that this step failed, rather than failing later.
Merge request reports
Activity
- Resolved by Cyril Danilevski
One question maybe is, should this function return
None
when the date could not be extracted, as it is now, or to raise?
I'd prefer to raise an error, so that it's clear that this step failed, rather than failing later.If the run is not found it will (and should) fail in other parts in whichever notebook as the raw data will not be available for any processing. So, I as well as you prefer to raise an error for failing this step with a proper message.
241 242 :param tsdir: to get modification time of [directory]/[run]04. 242 243 :param verbosity: Level of verbosity (0 - silent) 243 244 :return: (datetime) modification time 245 244 246 """ 247 directory = Path(directory) 248 proposal = int(directory.parent.name[1:]) 245 249 246 250 try: 247 path_list = list(filter(None, directory.strip('/').split('/'))) 248 proposal = int(path_list[-2][1:]) 249 251 run_info = get_run_info(proposal, run) we have begin_at in the metadata saved at myMDC, yes.
But this part only works in production as communication with MetadataClient needs the secrets written in the webservice yaml configurations.
We could then simply use
extra_data
How?
Edited by Karim Ahmedthe new HDF5 files (since ~February) have a few more metadata, they maybe contain all required information?
example:
[tmichela@max-exfl001]/gpfs/exfel/exp/SPB/202001/p002450/raw/r0122% h5glance RAW-R0122-DA04-S00000.h5 METADATA RAW-R0122-DA04-S00000.h5/METADATA (3 attributes) ├creationDate [ASCII str: 1] ├daqLibrary [ASCII str: 1] ├dataFormatVersion [ASCII str: 1] ├dataSources │ ├dataSourceId [ASCII str: 3] │ ├deviceId [ASCII str: 3] │ └root [ASCII str: 3] ├karaboFramework [ASCII str: 1] ├proposalNumber [uint32: 1] ├runNumber [uint32: 1] ├sequenceNumber [uint32: 1] └updateDate [ASCII str: 1]
so this would have the same
creationDate
asMetadataClient
or it just shows last time the file was modified?Edited by Karim Ahmedthe new HDF5 files (since ~February) have a few more metadata, they maybe contain all required information?
Oh, I didn't know that. That is very useful.
Edited by Karim AhmedOk, I'm cool to use the
creationDate
from the files, but there will still be a loop, as not all files are created at once:In [1]: import pathlib, h5py In [2]: p = pathlib.Path('/gpfs/exfel/d/raw/DETLAB/202031/p900172/r0072') In [3]: for f in p.iterdir(): ...: print(f) ...: RAW-R0072-AGIPD07-S00000.h5 RAW-R0072-AGIPD04-S00001.h5 RAW-R0072-AGIPD05-S00001.h5 RAW-R0072-AGIPD01-S00000.h5 RAW-R0072-AGIPD05-S00000.h5 RAW-R0072-AGIPD02-S00000.h5 RAW-R0072-AGIPD06-S00000.h5 RAW-R0072-AGIPD00-S00000.h5 RAW-R0072-AGIPD04-S00000.h5 RAW-R0072-AGIPD03-S00000.h5 RAW-R0072-DA01-S00000.h5 RAW-R0072-AGIPD00-S00001.h5 RAW-R0072-AGIPD03-S00001.h5 RAW-R0072-AGIPD07-S00001.h5 RAW-R0072-AGIPD01-S00001.h5 RAW-R0072-AGIPD06-S00001.h5 RAW-R0072-AGIPD02-S00001.h5 In [4]: dates = [] ...: for f in p.iterdir(): ...: with h5py.File(f) as fin: ...: dates.append(fin['METADATA/creationDate'][0]) In [5]: dates Out[6]: [b'20200904T101712Z', b'20200904T101737Z', b'20200904T101737Z', b'20200904T101712Z', b'20200904T101712Z', b'20200904T101712Z', b'20200904T101712Z', b'20200904T101712Z', b'20200904T101712Z', b'20200904T101712Z', b'20200904T101715Z', b'20200904T101737Z', b'20200904T101738Z', b'20200904T101737Z', b'20200904T101737Z', b'20200904T101737Z', b'20200904T101738Z']
Just to link things up, here's the issue for this in EXtra-data: https://github.com/European-XFEL/EXtra-data/issues/27
It's also partly stuck on a similar question: how do we efficiently present metadata in 200 separate files as metadata for a run?
- Resolved by Cyril Danilevski
I've updated the code with your suggestions.
I took the liberty to remove thetsdir
argument, after checking that it isn't used anywhere. It can be removed, since we never check when the file were added to disk, but rather use the newMETADATA/creationDate
.I've also added a test. It's not much, but it's honest.
One question maybe is, should this function return
None
when the date could not be extracted, as it is now, or to raise?This also ties into a larger question we need to tackle at some point (not in this PR): exceptions when running the notebooks through Slurm are currently ignored - it will try to run the rest of the cells, and the job will finish successfully no matter what happened, except if it hangs.
There are apparently some situations where this is desirable, so we can't just turn off ignoring errors. But we should at some point define which specific failures are safe to ignore, catch those exceptions, and then re-enable other exceptions.
added 2 commits
- Resolved by Cyril Danilevski
@danilevc, I have checked some old data from 2019 and it seems that
METADATA/creationDate
is not available. Maybe we leavetsdir
argument back, in that case for old data./gpfs/exfel/exp/SQS/201930/p900075/raw/r0365/RAW-R0365-PNCCD01-S00001.h5
Edited by Karim Ahmedadded 1 commit
- 985362db - Add support for older datasets in get_dir_creation_date
added 1 commit
- 7a862108 - Silently retrieve the creation date from directory
We've had a discussion where Karim highlighted that, as the calibration pipeline is launched from a web interface, nobody will ever set the
use_dir
flag.
As such, we implicitly use the directory's creation date if the information isn't available neither from MyMDC nor from the dataset, as might be the case for data prior to 2020.mentioned in commit e235e1e4