Extra data and xfel kernel
The former karabo-data package has been renamed in extra-data and part of it has been split in extra-geom.
-
testing with DSSC analysis workflow -
testing with FastCCD analysis workflow -
use extra-geom for DSSC geometry -
in DSSC.py, load_geom(), the is path = '/gpfs/exfel/sw/software/exfel_environments/misc/git/karabo_data/docs/dssc_geo_june19.h5'. This seems old, should it be updated?
Merge request reports
Activity
We should also switch to the xfel python environment but the netCDF4 (used to save intermediate results such as dark processed run) is not available in environment. I asked DA to include it: https://in.xfel.eu/redmine/issues/60757
added 2 commits
Regarding the path to the geometry file, there is a ticket: https://in.xfel.eu/redmine/issues/60716
❯ grep -n karabo *.py bunch_pattern.py:17: https://git.xfel.eu/gitlab/karaboDevices/euxfel_bunch_pattern bunch_pattern.py:22: runDir: karabo_data run directory. Required only if bp_table is None. DSSC.py:243: path = '/gpfs/exfel/sw/software/exfel_environments/misc/git/karabo_data/docs/dssc_geo_june19.h5'
Remaining mention of karabo_data are:
- in description in bunch_pattern.py line 22. Maybe @mercadil you have some correction to propose ?
- dssc geometry file path. We could add the file to the toolbox. I guess in some future there will be a calibration database...
I'm not sure if importing the file in the Toolbox is the best: each user has a local copy of the Toolbox somewhere, but we need an absolute path to the file. How to proceed when we change the current directory within the notebook (%cd new_path)? I think for now we could update the path to the extra-geom: path = '/gpfs/exfel/sw/software/git/EXtra-geom/docs/dssc_geo_june19.h5'
for bunch_pattern.py line 22, we can just change karabo_data to extra-data. This can be done at the same time as what we decide for the geom file.
Using notebook from https://in.xfel.eu/gitlab/SCS/ToolBox/merge_requests/45
- dark multiprocessing works
- loading multiprocessed data and computing azimuthal scans works
- single processing run data works
- multiprocessing run data stops before running out of memory
Following https://in.xfel.eu/gitlab/SCS/ToolBox/merge_requests/61/diffs I switch to joblib for multiprocessing. The immediate advantage is that I now see error when the processing hangs.
After that change it appeared that reading out the scan_variable file for the binning in each process creates some conflict with the xarray cache system so the scan_variable content is now sent to each process instead of the file name.
While investigating this problem I came across https://github.com/pydata/xarray/issues/3785 which show that loading files with open_dataset leads to some weird behavior with reloading modified files. We have observed in the past this behavior. This is not related to a cache mechanism but to the fact that files are not closed. Using load_dataset instead, which immediately close the file after loading the data should solve these problems.
There is still some issue with import netCDF4. I tried importing it in the individual python file of the ToolBox but that doesn't work. The only place that seem reliable is in the notebook before importing the ToolBox.
added 1 commit
- 7ef020b1 - Load and close netcdf file to avoid unwanted caching behavior
The test notebook calculation completes but the images have lots of new artifacts which I haven't seen before:
added 1 commit
- ce89f22a - Clean up remaining scan_variable.nc file saving code
added 1 commit
- d59aabb0 - Keep virtual dataset h5 files closed outside 'with' context
The calculation seems to depends on how many worker are used. With 8 workers, the dark and delay scan binning results are similar to the master branch output but the energy binning results are way off (values of 1e297...).
With 16 workers, I get an error in the groupby that the scan_variable is empty which doesn't make sense...