[Webservice] Don't mark jobs as finished just because they disappear from squeue output

Review changes
Download
Patches
Plain diff

Merged [Webservice] Don't mark jobs as finished just because they disappear from squeue output

fix/jobmon-slurm-finish-states into master

Overview 10
Commits 3
Pipelines 0
Changes 1

Merged Thomas Kluyver requested to merge fix/jobmon-slurm-finish-states into master 2 years ago

Overview 10
Commits 3
Pipelines 0
Changes 1

Description

We discovered some calibrations appeared to have failed in myMdC even though they succeeded. When we looked in the calibration webservice DB, we found jobs marked finished but with status 'RUNNING' or 'PENDING'. Failure is reported when finished jobs have a status other than COMPLETED.

My best guess is that jobs which are pre-empted and requeued briefly disappear from the output of squeue. The docs for squeue say that "If no state is specified then pending, running, and completing jobs are reported," which appears to exclude the REQUEUED state. Then when we look up these jobs with sacct, they are PENDING or RUNNING again.

This determines when a job is finished based on its state instead. I've collected all the states which I think should be final; this list is already used in sfollow. I think the list of non-final states would be slightly longer.

We discussed whether there should be a timeout after which we stop checking jobs. I haven't implemented this yet because it's extra complexity and I'm not sure of the details - e.g. do we start timing only once jobs leave PENDING state?

How Has This Been Tested?

TBD

Types of changes

Bug fix (non-breaking change which fixes an issue)

Checklist:

My code follows the code style of this project.

Reviewers

@ahmedk @schmidtp

Merge request reports

0 Assignees

None

Select assignees

0 Reviewers

Request review from

Labels

None

Select labels

Manage project labels

Milestone

None

Time tracking

No estimate or time spent

0 Participants