[Webservice] Don't mark jobs as finished just because they disappear from squeue output
I think the list of states which mean still going/waiting in various ways is slightly longer than those which mean finished, so there's probably a similar chance of mistakenly including something in that list as there is of missing it out of this one.
Though I guess doing it that way would fail quickly and obviously if a new version of Slurm adds further states to the list.Some examples of statuses which I'm unsure about:
- "RESV_DEL_HOLD: Job is being held after requested reservation was deleted." - in theory I guess the job could be modified and then released to run, but in practice we'd probably start a new job instead.
- "REVOKED: Sibling was removed from cluster due to other cluster starting the job." - we don't use the multi-cluster features, and I don't know if squeue shows all clusters or just the one you're on.
- "SPECIAL_EXIT: The job was requeued in a special state. This state can be set by users, typically in EpilogSlurmctld, if the job has terminated with a particular exit value." - Exit implies it's finished, but requeued implies it will run again.
Using both start and end date, it appears you're able to query whether any given state ever occured, e.g.:
[xcal@max-exfl016 current]$ sacct -S now-10000day -E now -u xcal -s NODE_FAIL JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 19180 correct_M+ exfel exfel 72 NODE_FAIL 0:0 19180.batch batch exfel 72 CANCELLED 1048849 correct_S+ upex-midd+ exfel 72 NODE_FAIL 0:0 1048849.bat+ batch exfel 72 CANCELLED 1048850 correct_S+ upex-midd+ exfel 72 NODE_FAIL 0:0 1048850.bat+ batch exfel 72 CANCELLED 1048863 correct_S+ upex-midd+ exfel 72 NODE_FAIL 0:0 1048863.bat+ batch exfel 72 CANCELLED 1048960 correct_S+ upex-midd+ exfel 72 NODE_FAIL 0:0 1048960.bat+ batch exfel 72 CANCELLED
Hmmm, good point. According to sacct, no xcal jobs have been REQUEUED, and a couple of specific ones I checked where we saw this kind of issue (1020360 & 1026886) have never been in any state other than PENDING, RUNNING or COMPLETED.
However, some other jobs, including recent ones (latest 1026802 on 2023-04-01), have been PREEMPTED but are now COMPLETED. I thought PREEMPTED was a final state for jobs that couldn't be requeued, but I guess I was wrong about that.
So I'm back to not really knowing what's going on. Maybe the accounting database sacct uses is sometimes slightly outdated, so
squeue
knows the job has finished butsacct
doesn't. If that's the case, this would fix it, but I'm not convinced.
While I was touching this code, I rearranged it so we don't run
squeue
at all when there are no unfinished jobs in the database.