Skip to content
Snippets Groups Projects

Reproducibility, step 1

Merged Thomas Kluyver requested to merge feat/reproducibility into master
3 unresolved threads

Description

This is meant to be a first step towards reproducible calibration. Specifically, given the work directory (containing the parameterised ipynb files) from a previous calibration run, we can re-run the notebooks with a specified target Python, optionally directing output to a new location. I think this will also be useful for testing, because it will be easy to prepare the plan of what work is to be done without launching it.

There are two main limitations in terms of reproducibility, which I hope to address in later MRs:

  • Package versions are recorded (pip freeze), but for now you have to create a suitable environment manually, and tell the 'repeat' machinery to use it.
  • We can run the same notebooks with the same parameters, but they may use changing external resources like the calibration database. This will need to be addressed within the notebooks.

Open questions:

  • Naming, of course:
    • The JobArgs, Step, JobGroup classes for describing different levels of work to be done.
    • I've called the module 'repeat'. But if there's an option like --prepare to do everything up to submitting the jobs, then you could be using the same thing to run it for the first time.
  • What information should be stored in what files & formats?
    • I added a new steps.json describing how to run the notebooks in the directory.
    • Python packages are listed in requirements.txt
    • It also uses some info from calibration_metadata.yml
      • This has a mixture of information from the machinery (like the parameters) and from the specific pipeline (like calibration constants) - should they be split up?
      • It's saved both in the work directory (where the .ipynb files go) and the output directory - what if they get out of sync?

How Has This Been Tested?

Running commands on Maxwell. Run a correction:

xfel-calibrate AGIPD CORRECT \
    --in-folder /gpfs/exfel/exp/SPB/202131/p900230/raw --run 174 \
    --out-folder /gpfs/exfel/data/scratch/kluyvert/agipd-calib-900230-174-repro \
    --karabo-id SPB_DET_AGIPD1M-1 --karabo-id-control SPB_IRU_AGIPD1M1 --karabo-da-control AGIPD1MCTRL00 \
    --modules 0-4

Repeat it:

python3 -m xfel_calibrate.repeat \
    /gpfs/exfel/data/scratch/kluyvert/agipd-calib-900230-174-repro/slurm_out_AGIPDOfflineCorrection \
    --out-folder /gpfs/exfel/data/scratch/kluyvert/agipd-calib-900230-174-repro2

Types of changes

  • New feature (non-breaking change which adds functionality)
  • Refactor (refactoring code with no functionality changes)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.

Reviewers

@calibration

Edited by Thomas Kluyver

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Robert Rosca
  • Robert Rosca
  • Only read through the code, I'll test this after the the panosc meeting and after lunch (so ~2pm) and might leave more comments.

    Could you add a few lines to the readme explaining that this is done, and the idea behind it?

  • Thomas Kluyver added 1 commit

    added 1 commit

    • 702f1e0b - Apply suggestion to src/xfel_calibrate/calibrate.py

    Compare with previous version

  • Thomas Kluyver added 1 commit

    added 1 commit

    • 2d2ad0ef - Update README about running & re-running calibration

    Compare with previous version

  • I've updated the README for this - and also updated the information about running xfel-calibrate in general.

  • Robert Rosca resolved all threads

    resolved all threads

  • Karim Ahmed
  • 15
    16 def update_notebooks_params(working_dir: Path, new_values):
    17 """Modify parameters in copied notebooks"""
    18 for nb_file in working_dir.glob('*.ipynb'):
    19 nb = nbformat.read(nb_file, as_version=4)
    20 params = extract_parameters(nb, lang='python')
    21 params = parameter_values(params, **new_values)
    22 new_nb = replace_definitions(nb, params, execute=False, lang='python')
    23 nbformat.write(new_nb, nb_file)
    24
    25 def new_report_path(old_out_folder, old_report_path, new_out_folder):
    26 """Update report path if it was in the old output folder"""
    27 try:
    28 report_in_output = Path(old_report_path).relative_to(old_out_folder)
    29 except ValueError:
    30 return old_report_path
    • Wouldn't the report be overwritten this way if it was not in the out_folder?

      Maybe adding a timestamp into the report name can be moved from the webservice to calibrate.py and we always attach the timestamp of correction triggering (1st time or a repeat) to the report name.

      Edited by Karim Ahmed
    • Or Maybe change the report path base as the "working_dir" during repeat.

    • Yes, I think you're right. The reports have a timestamp in the filename, but that's set at a higher level and is just a fixed path as far as xfel-calibrate is concerned, so repeat would use the same path.

      We'll need to work this out, but I think the main focus for reproducing will be correction, where we normally put the report in the output directory.

    • I've opened an issue (#57) to try to help remember this.

    • Please register or sign in to reply
  • Karim Ahmed
  • 24
    25 def new_report_path(old_out_folder, old_report_path, new_out_folder):
    26 """Update report path if it was in the old output folder"""
    27 try:
    28 report_in_output = Path(old_report_path).relative_to(old_out_folder)
    29 except ValueError:
    30 return old_report_path
    31 else:
    32 return str(Path(new_out_folder, report_in_output))
    33
    34 def main(argv=None):
    35 ap = argparse.ArgumentParser()
    36 ap.add_argument("from_dir", type=Path, help="A directory containing steps.json")
    37 ap.add_argument("--python", help="Path to Python executable to run notebooks")
    38 ap.add_argument("--out-folder", help="Directory to put output data")
    39 ap.add_argument("--slurm-partition", help="Submit jobs in this Slurm partition")
    • Would it make sense to have only a --slurm-partition argument and remove --no-cluster-job?
      As such, the notebooks could be ran sequentially if --slurm-partition is not specified. If it is, then jobs could run on the specified or some default partition.

    • It would be neat to have one fewer option, but I think that running through Slurm is the normal use case, so I'd like that to be as easy as possible. We also don't want to make it easy to accidentally start running lots of work on a login node.

      Maybe I'm biased by testing with the relatively heavy AGIPD correction, though? Are there other use cases where we run with --no-cluster-job often?

    • Please register or sign in to reply
  • Cyril Danilevski
  • Karim Ahmed
  • Karim Ahmed
  • 1113 1225 if any(j is not None for j in joblist):
    1114 1226 print("Submitted the following SLURM jobs: {}".format(",".join(joblist)))
    • I can see that there is duplication for this final part of the code for example here and in repeat.py. maybe this can be avoided?

    • Possibly, though I think it's not easy to factor out much code without making a function with a dozen parameters, which I think is an antipattern. I'll take a look.

    • Please register or sign in to reply
  • Karim Ahmed
  • Karim and I had a look now at the MR in more details.

    We tried running recalibration on some Epix data at MID. It ran fine, no particular comments on that.
    We also re-calibrated the output of a previous re-calibration, and the outcome was the same, as expected.

    The decoupling of slurm from pycalibration was glanced over, as it looks more like its own feature.
    We observed a couple of issues that Karim will document below.

    We liked the decoupling of the preparation and execution of notebooks (with --prepare-only.)
    Whilst at first glance it looks like a feature for power users, we like that it may eventually lead to a unified way to start jobs, whether fresh calibration or recalibration.

    However, as Karim documented, and you commented on, requirements.txt needs a bit of work.

    There are two questions that need clarification.
    The recalibration assumes that the whole slurm_out folder will always be available.
    I think that the policy, but we should perhaps confirm that with ITDM, to make sure we're all on the same page.

    Similarly, when updating notebooks, should pre-notebooks be ignored? In the case of AGIPD correction, the pre-notebook is used to retrieve constants.
    It would be nice to reuse these constants (as some other constant might be injected at a later time.)
    As with the slurm folder, there is an implicit agreement to keep the constants, but this should be clarified.

    Lastly, tangential to this, we have this run_calibrate.sh script that's created for every job. It's handy, and we think it should be kept, as it's easier to grok the command rather than checking/editing two files (steps.json and metadata.yaml.) We should perhaps keep this in mind.
    We think keeping run_calibrate.sh is no big deal.

    Thanks!

  • Thanks for digging in to this!

    The recalibration assumes that the whole slurm_out folder will always be available.
    I think that the policy, but we should perhaps confirm that with ITDM, to make sure we're all on the same page.

    Yup, that has been my assumption, but it definitely makes sense to discuss it with ITDM.

    Similarly, when updating notebooks, should pre-notebooks be ignored? In the case of AGIPD correction, the pre-notebook is used to retrieve constants.
    It would be nice to reuse these constants (as some other constant might be injected at a later time.)

    We discussed at some point whether to make looking up constants a separate step that runs before the reproducible part of the pipeline. That's arguably neater, but it would be more complexity - the xfel-calibrate machinery would have to understand more about the code in the notebooks that it's executing.

    Our conclusion was that for now, we should do this ad-hoc, i.e. the code retrieving the constants should check the YAML file first, and only query the database if there aren't constants already identified. I plan to do that in a separate PR after this one. We could always move to a more formal separation of finding constants later on if we need to.

    As with the slurm folder, there is an implicit agreement to keep the constants, but this should be clarified.

    :thumbsup:

    Lastly, tangential to this, we have this run_calibrate.sh script that's created for every job. It's handy, and we think it should be kept, as it's easier to grok the command rather than checking/editing two files (steps.json and metadata.yaml.) We should perhaps keep this in mind.

    run_calibrate.sh records the xfel-calibrate command used to set up the calibration initially. We could also write a similar script with the 'repeat' command to (re-)run the calibration from this working directory.

    I struggle with the naming for things like this. I think 'repeat' is clear if it ran before, but if you use the xfel-calibrate --prepare-only option to set up this directory, then you're repeat-ing something that didn't happen yet, which is confusing. But 'run' and 'execute' don't seem very specific. :confused:

  • Thomas Kluyver added 1 commit

    added 1 commit

    • 9c29fcff - Copy bash script to run notebook to working directory

    Compare with previous version

  • It occurred to me that I should copy the run_calibrate.sh script which is submitted to Slurm to the working directory to be preserved alongside the notebooks it runs. If for example we manage to get rid of ipyparallel, we'll want to use the old version of the script (which can spin up ipyparallel workers) alongside the old jobs that still expect that.

    I've made this change and tried it with xfel-calibrate ... --prepare-only and python -m xfel_calibrate.repeat .

  • I'm thinking of building the calibration_metadata.yml file in the working directory (alongside the parameterised notebooks, Slurm .out files, etc.), and then copying it into the output directory as part of finalize.py. This should mean we never read information from the output directory in the calibration machinery.

  • Hmmm, on second thoughts, that probably doesn't work for running the notebooks interactively. I'll need to think more about that.

  • Thomas Kluyver mentioned in merge request !563 (merged)

    mentioned in merge request !563 (merged)

  • Thomas Kluyver changed the description

    changed the description

  • Thomas Kluyver added 2 commits

    added 2 commits

    • f010451e - Ignore existing calibration_metadata.yml in xfel-calibrate
    • 437a9615 - Add xfel-calibrate --constants-from option to reuse constant info from a calibration_metadata.yml

    Compare with previous version

  • Thomas Kluyver added 124 commits

    added 124 commits

    • 437a9615...119a1ff5 - 99 commits from branch master
    • 9b6e0cf1 - Initial prototype of describing jobs for reproducing work
    • 171f495d - Fix pip command
    • d2fce235 - after_ok and after_any are optional
    • 8bf6fb7e - Use CWD for notebook paths
    • 4e26f4df - Make class to store Slurm related options
    • 61688e6d - Initial work towards re-running calibration job
    • bb3a6a25 - Run finalize step after repeating task
    • bc4c721d - Fix getting report path from metadata
    • 6d307918 - Allow repeating calibration with a different output folder
    • 0d5cd44a - Ensure out-folder is created as a directory
    • c4be4941 - Update report path when it's in output path
    • 0f55c690 - Rethink what information goes in which file
    • 03fa7969 - Fix tests
    • fb9f928d - Mock 'python --version' command
    • 0afe5941 - Save string path to python instead of Path object in YAML
    • f9c00f3e - Fix user venv test
    • cfa9e8ab - Add --prepare-only option for xfel-calibrate
    • b2d76a6b - Copy calibration_metadata.yml when --prepare-only is used
    • 4d35534c - Need a Path object, not a str
    • 7d03a959 - Apply suggestion to src/xfel_calibrate/calibrate.py
    • cd10351a - Update README about running & re-running calibration
    • e9bdc837 - Copy bash script to run notebook to working directory
    • d356d0e8 - Ignore existing calibration_metadata.yml in xfel-calibrate
    • 172adc2f - Add xfel-calibrate --constants-from option to reuse constant info from a calibration_metadata.yml
    • 2c4dd373 - Mention new= in CalibrationMetadata docstring

    Compare with previous version

  • Thomas Kluyver added 1 commit

    added 1 commit

    • 57cc19e0 - Update out-folder in calibration_metadata.yml on repeat

    Compare with previous version

  • Thomas Kluyver changed the description

    changed the description

  • Karim Ahmed
  • This LGTM. One last comment is that adding docstring to the new functions and classes would be nice.

  • Yup, good point.

  • Thomas Kluyver added 1 commit

    added 1 commit

    • 4e64ee3d - Rename JobGroup -> JobChain, add docstrings

    Compare with previous version

  • I'll have another look at this after the standup :thumbsup:

  • Robert Rosca
  • Thomas Kluyver added 1 commit

    added 1 commit

    • 7b23765a - Clarify how dep_job_ids changes

    Compare with previous version

  • Thomas Kluyver mentioned in commit 9fb48a30

    mentioned in commit 9fb48a30

  • Philipp Schmidt changed milestone to %3.4.3

    changed milestone to %3.4.3

  • Thomas Kluyver mentioned in merge request !602 (merged)

    mentioned in merge request !602 (merged)

  • Please register or sign in to reply
    Loading