Skip to content
Snippets Groups Projects

Stop execution on notebook errors

Merged Thomas Kluyver requested to merge on-error-give-up into master

Description

See calibration/planning#34. Trying to carry on executing a notebook after an uncaught exception rarely works, because the later code often depends on things which didn't happen (variables not defined, files not written, etc.). This stops notebook execution on uncaught errors, as is the norm for Python (and, I think, most languages with exceptions). This should make for clearer output when things do go wrong (because there aren't a load of meaningless extra errors after the real problem).

Slurm still shouldn't treat the jobs as failing, because bash does essentially the same thing by default: unless you use set -e, it will ignore failing commands and carry on to the next one. So reports should still be generated and files moved as expected. I plan to address that in a later step.

How Has This Been Tested?

Running xfel-calibrate at the command line. Output is in /gpfs/exfel/data/scratch/kluyvert/agipd-calib-900201-203-errstop.

Types of changes

  • Bug fix
  • Breaking change (potentially, if there are cases which rely on subsequent cells executing after an error)

Checklist:

  • My code follows the code style of this project.

Reviewers

@schmidtp @hammerd

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • So we can neatly test with this patch on Wednesday, but I'm wondering whether it makes sense to have it as a toggle-able option in the configuration. We don't have a proper way to set global defaults right now, but this might still allow us to change the behaviour quickly during operations. Alternatively, we might simply revert this change in these cases.

  • I agree that it's probably useful to be able to undo the change quickly if it causes a problem. Three options in order of increasing technical complexity:

    1. Just revert the change as a hotfix if needed: add --on-error-resume-next back to the script and deploy pycalibration again.
    2. Add an option like xfel-calibrate --ignore-nb-errors, which could be added to webservice config to go back globally. Change the YAML file, restart the webservice.
    3. Allow controlling it per instrument/proposal/detector via calibration_configurations. Use update_config.py to flip the switch as needed.

    I'm leaning towards 1. I think the main reason to add configuration options would be as a signal that we're serious about not breaking things - but so long as we can fix any issues quickly, I'm not sure anyone cares about how we do it. And if we add a config option, there's a risk that it becomes basically permanent.

    • So currently with this MR there will be a produced pdf (without the error message) and the slurm out files (error message and logs in slurm***.out). How are you planning to address this later as you mentioned?

    • I put my voice with (1)

    Just revert the change as a hotfix if needed: add --on-error-resume-next back to the script and deploy pycalibration again.

    in case a lot of unexpected errors occurred. We should aim to fix the errors before reverting this.

    Edited by Karim Ahmed
  • You're right, the notebook which errors, and therefore that section of the PDF report, won't contain any output. I thought it would be saved with the output up to that point, but I had remembered that wrong. Maybe we need to fix that in princess first.

  • Thomas Kluyver added 1 commit

    added 1 commit

    • 900df369 - Use princess 0.3 to save notebook after error

    Compare with previous version

  • I have merged the change in princess, and updated the version we use, so the notebooks should still be saved after an error, and therefore the output should appear in the PDF report.

  • Thomas Kluyver added 36 commits

    added 36 commits

    Compare with previous version

  • Thanks, LGTM.

    So what is the plan now we go with which point? and It would be good to have it merged before next deployment so we can test it along with other developed MRs.

    Edited by Karim Ahmed
  • I think we're OK to just revert the change if needed - no-one seems to have really argued that that's insufficient. But I'd like at least @schmidtp to OK that before I merge it.

  • I like it, let's do it.

  • Thomas Kluyver mentioned in commit 06f3c810

    mentioned in commit 06f3c810

  • Thomas Kluyver mentioned in merge request !525 (merged)

    mentioned in merge request !525 (merged)

  • Philipp Schmidt changed milestone to %3.4.1

    changed milestone to %3.4.1

Please register or sign in to reply
Loading