Stop execution on notebook errors
Description
See calibration/planning#34. Trying to carry on executing a notebook after an uncaught exception rarely works, because the later code often depends on things which didn't happen (variables not defined, files not written, etc.). This stops notebook execution on uncaught errors, as is the norm for Python (and, I think, most languages with exceptions). This should make for clearer output when things do go wrong (because there aren't a load of meaningless extra errors after the real problem).
Slurm still shouldn't treat the jobs as failing, because bash does essentially the same thing by default: unless you use set -e
, it will ignore failing commands and carry on to the next one. So reports should still be generated and files moved as expected. I plan to address that in a later step.
How Has This Been Tested?
Running xfel-calibrate
at the command line. Output is in /gpfs/exfel/data/scratch/kluyvert/agipd-calib-900201-203-errstop
.
Types of changes
- Bug fix
- Breaking change (potentially, if there are cases which rely on subsequent cells executing after an error)
Checklist:
- My code follows the code style of this project.
Reviewers
Merge request reports
Activity
So we can neatly test with this patch on Wednesday, but I'm wondering whether it makes sense to have it as a toggle-able option in the configuration. We don't have a proper way to set global defaults right now, but this might still allow us to change the behaviour quickly during operations. Alternatively, we might simply revert this change in these cases.
I agree that it's probably useful to be able to undo the change quickly if it causes a problem. Three options in order of increasing technical complexity:
- Just revert the change as a hotfix if needed: add
--on-error-resume-next
back to the script and deploy pycalibration again. - Add an option like
xfel-calibrate --ignore-nb-errors
, which could be added to webservice config to go back globally. Change the YAML file, restart the webservice. - Allow controlling it per instrument/proposal/detector via calibration_configurations. Use
update_config.py
to flip the switch as needed.
I'm leaning towards 1. I think the main reason to add configuration options would be as a signal that we're serious about not breaking things - but so long as we can fix any issues quickly, I'm not sure anyone cares about how we do it. And if we add a config option, there's a risk that it becomes basically permanent.
- Just revert the change as a hotfix if needed: add
-
So currently with this MR there will be a produced pdf (without the error message) and the slurm out files (error message and logs in slurm***.out). How are you planning to address this later as you mentioned?
-
I put my voice with (1)
Just revert the change as a hotfix if needed: add
--on-error-resume-next
back to the script and deploy pycalibration again.in case a lot of unexpected errors occurred. We should aim to fix the errors before reverting this.
Edited by Karim Ahmed-
added 1 commit
- 900df369 - Use princess 0.3 to save notebook after error
added 36 commits
-
900df369...7acf1828 - 34 commits from branch
master
- 03046a38 - Stop execution on notebook errors
- b4049d62 - Use princess 0.3 to save notebook after error
-
900df369...7acf1828 - 34 commits from branch
Thanks, LGTM.
So what is the plan now we go with which point? and It would be good to have it merged before next deployment so we can test it along with other developed MRs.
Edited by Karim AhmedI think we're OK to just revert the change if needed - no-one seems to have really argued that that's insufficient. But I'd like at least @schmidtp to OK that before I merge it.
mentioned in commit 06f3c810
mentioned in merge request !525 (merged)
changed milestone to %3.4.1