Skip to content
Snippets Groups Projects

[webservice] Support requests to repeat correction from myMdC

Merged Thomas Kluyver requested to merge feat/webservice-repeat into master

Description

We have been working to enable reproducible correction. This aims to integrate it with the webservice, so deleted data can be recreated on demand by clicking a button. Changes to myMdC will be needed to actually send these requests, but for now the idea is to make them like the existing correction & dark processing requests (even if passing a Python list is ugly).

To do:

  • Pick somewhere for caching virtualenvs used in reproducibility and set that
  • Defer launching the jobs until after replying to ZMQ message?

How Has This Been Tested?

With the included request_repeat.py script on max-exfl017:

python3 request_repeat.py --endpoint tcp://max-exfl017:5555 --mymdc-id 89178 900113 9982

Types of changes

  • New feature (non-breaking change which adds functionality)

Checklist:

  • My code follows the code style of this project.

Reviewers

Edited by Thomas Kluyver

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
933 934
934 935 return queued_msg.encode()
935 936
937 async def handle_repeat(self, rid, instrument, cycle, proposal, runnr):
938 request_time = datetime.now().strftime('%Y-%m-%dT%H:%M:%S')
939
940 with self.job_db:
941 cur = self.job_db.execute(
942 "INSERT INTO requests VALUES (NULL, ?, ?, ?, 'CORRECT', ?)",
  • For now, these are stored as correction requests in the database. We could make a new 'REPEAT' type. I think serve_overview is the only consumer of this column, so it's up to us how we use it.

  • changed this line in version 2 of the diff

  • How about RECORRECT? I'd like to somehow include this interface is only meant for corrections. It is clear the actual xfel-calibrate-repeat can work for any action.

  • I realised after writing the comment that of course the main thing using this field is the job monitoring service. Reusing 'CORRECT' here means it should automatically do the right thing in terms of updating the run in myMdC and sending Kafka notifications when corrections finish. Obviously we can make it work with a new value as well, but that involves more changes.

  • I see, thanks! Right now it seems indeed the job monitor should treat such runs identically, so this is fine. If there any discrepancies, they might turn up when we finalize how the repeat-corrections flow should work in myMDC.

    I'm realizing now the comment to use RECORRECT rather than REPEAT also fits the MR in general. Given the strict (read: limited) API I would not foresee to use this for anything other than corrections.

  • Please register or sign in to reply
  • 966
    967 logging.info("Found %d corrections to re-run for p%s r%s: %s",
    968 len(mddirs_by_krb_id), proposal, runnr, list(mddirs_by_krb_id))
    969
    970 for karabo_id, mddirs in mddirs_by_krb_id.items():
    971 # Select the latest metadata directory - with the highest request
    972 # time - to re-run for each detector (karabo_id)
    973 mddir, _ = max(mddirs, key=lambda p: p[1])
    974
    975 logging.info("Repeating correction for %s from %s", karabo_id, mddir)
    976
    977 cmd = ['python', '-m', 'xfel_calibrate.repeat', mddir]
    978
    979 with self.job_db:
    980 cur = self.job_db.execute(
    981 "INSERT INTO executions VALUES (NULL, ?, ?, NULL, ?, NULL)",
  • Thomas Kluyver added 2 commits

    added 2 commits

    • dd7e5433 - Restructure code to launch jobs after sending ZMQ reply
    • 852ac9e1 - Create virtualenvs for repeating correction under /scratch

    Compare with previous version

  • Thomas Kluyver marked the checklist item Defer launching the jobs until after replying to ZMQ message? as completed

    marked the checklist item Defer launching the jobs until after replying to ZMQ message? as completed

  • Thomas Kluyver marked the checklist item Pick somewhere for caching virtualenvs used in reproducibility and set that as completed

    marked the checklist item Pick somewhere for caching virtualenvs used in reproducibility and set that as completed

  • 990 logging.warning(msg)
    991 await update_mdc_status(self.mdc, 'correct', rid, msg)
    992 return
    993
    994 await update_mdc_status(self.mdc, 'correct', rid, queued_msg)
    995
    996 ret = []
    997 for karabo_id, mddirs in mddirs_by_krb_id.items():
    998 # Select the latest metadata directory - with the highest request
    999 # time - to re-run for each detector (karabo_id)
    1000 mddir, _ = max(mddirs, key=lambda p: p[1])
    1001
    1002 logging.info("Repeating correction for %s from %s", karabo_id, mddir)
    1003
    1004 cmd = ['python', '-m', 'xfel_calibrate.repeat', mddir,
    1005 '--env-cache', '/scratch/xcal/repeat-envs']
    • This is where virtualenvs will be created for repeating corrections. I went for /scratch because it has a fairly large amount of free space (32 GB on max-exfl016), and it doesn't matter if it gets cleaned up from time to time since it's just a cache. But there are other options, e.g. under xcal's home directory, where we deploy the 'current' offline calibration environment.

    • Maybe @jmalka has an opinion here?

      Using local storage is proably beneficial given environments and cluster filesystems don't make a particular good couple...

    • D'oh, obviously /scratch doesn't work: the correction jobs run on other nodes via Slurm, so the Python environment has to be on a shared filesystem.

    • changed this line in version 6 of the diff

    • The current revision puts environments in /gpfs/exfel/data/scratch/xcal/calib-repeat-envs (using the username it's running as).

    • Please register or sign in to reply
  • Thomas Kluyver added 2 commits

    added 2 commits

    • 179713a0 - Add script for testing repeat mechanism in webservice
    • 298850f3 - Fix getting reports directory

    Compare with previous version

    • (repeating this here for clearer readability)

      I think I would prefer to use the term RECORRECT rather than REPEAT here. While the machienry itself works for any xfel-calibrate action of course, the strict (read: limited) API only really works for corrections, so we might as well reflect that.

    • I've changed the action name (what myMdC will send us) to 'recorrect'. It's still 'CORRECT' in the internal SQLite DB, for simplicity.

    • Please register or sign in to reply
  • Thomas Kluyver added 1 commit

    added 1 commit

    • e66831f2 - Allow specifying myMdC ID in request_repeat script

    Compare with previous version

  • Thomas Kluyver added 2 commits

    added 2 commits

    • ac7aa753 - Actually use passed ID for myMdC
    • 8a109ab3 - Convert Path to str() for command arguments

    Compare with previous version

  • Thomas Kluyver added 1 commit

    added 1 commit

    • 4967a143 - Make environment in shared filesystem

    Compare with previous version

  • Thomas Kluyver added 1 commit

    added 1 commit

    • 4d2506af - Put reports from recorrection in proposal usr folder

    Compare with previous version

  • Thomas Kluyver marked this merge request as ready

    marked this merge request as ready

  • Thomas Kluyver changed the description

    changed the description

  • Thomas Kluyver added 1 commit

    added 1 commit

    • 20293030 - Change action name to recorrect

    Compare with previous version

  • Thomas Kluyver added 1 commit

    added 1 commit

    • d97fe8e9 - Make repeat parameter --report-to to match calibrate

    Compare with previous version

  • Philipp Schmidt changed milestone to %3.6.2

    changed milestone to %3.6.2

  • No comment from me, LGTM, thanks.

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading