Skip to content

[Webservice] Don't mark jobs as finished just because they disappear from squeue output

Thomas Kluyver requested to merge fix/jobmon-slurm-finish-states into master

Description

We discovered some calibrations appeared to have failed in myMdC even though they succeeded. When we looked in the calibration webservice DB, we found jobs marked finished but with status 'RUNNING' or 'PENDING'. Failure is reported when finished jobs have a status other than COMPLETED.

My best guess is that jobs which are pre-empted and requeued briefly disappear from the output of squeue. The docs for squeue say that "If no state is specified then pending, running, and completing jobs are reported," which appears to exclude the REQUEUED state. Then when we look up these jobs with sacct, they are PENDING or RUNNING again.

This determines when a job is finished based on its state instead. I've collected all the states which I think should be final; this list is already used in sfollow. I think the list of non-final states would be slightly longer.

We discussed whether there should be a timeout after which we stop checking jobs. I haven't implemented this yet because it's extra complexity and I'm not sure of the details - e.g. do we start timing only once jobs leave PENDING state?

How Has This Been Tested?

TBD

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • My code follows the code style of this project.

Reviewers

@ahmedk @schmidtp

Merge request reports