Skip to content
Snippets Groups Projects

[Webservice] Don't mark jobs as finished just because they disappear from squeue output

Merged Thomas Kluyver requested to merge fix/jobmon-slurm-finish-states into master
2 unresolved threads
1 file
+ 20
7
Compare changes
  • Side-by-side
  • Inline
+ 20
7
@@ -23,6 +23,11 @@ except ImportError:
@@ -23,6 +23,11 @@ except ImportError:
log = logging.getLogger(__name__)
log = logging.getLogger(__name__)
 
STATES_FINISHED = { # https://slurm.schedmd.com/squeue.html#lbAG
 
'BOOT_FAIL', 'CANCELLED', 'COMPLETED', 'DEADLINE', 'FAILED',
 
'OUT_OF_MEMORY', 'SPECIAL_EXIT', 'TIMEOUT',
 
}
 
class NoOpProducer:
class NoOpProducer:
"""Fills in for Kafka producer object when setting that up fails"""
"""Fills in for Kafka producer object when setting that up fails"""
@@ -50,10 +55,10 @@ def slurm_status(filter_user=True):
@@ -50,10 +55,10 @@ def slurm_status(filter_user=True):
:return: a dictionary indexed by slurm jobid and containing a tuple
:return: a dictionary indexed by slurm jobid and containing a tuple
of (status, run time) as values.
of (status, run time) as values.
"""
"""
cmd = ["squeue"]
cmd = ["squeue", "--states=all"]
if filter_user:
if filter_user:
cmd += ["--me"]
cmd += ["--me"]
res = run(cmd, stdout=PIPE)
res = run(cmd, stdout=PIPE, stderr=PIPE)
if res.returncode == 0:
if res.returncode == 0:
rlines = res.stdout.decode().split("\n")
rlines = res.stdout.decode().split("\n")
statii = {}
statii = {}
@@ -65,6 +70,10 @@ def slurm_status(filter_user=True):
@@ -65,6 +70,10 @@ def slurm_status(filter_user=True):
except ValueError: # not enough values to unpack in split
except ValueError: # not enough values to unpack in split
pass
pass
return statii
return statii
 
else:
 
log.error("Running squeue failed. stdout: %r, stderr: %r",
 
res.stdout.decode(), res.stderr.decode())
 
return None
def slurm_job_status(jobid):
def slurm_job_status(jobid):
@@ -148,15 +157,19 @@ class JobsMonitor:
@@ -148,15 +157,19 @@ class JobsMonitor:
Newly completed executions are present with an empty list.
Newly completed executions are present with an empty list.
"""
"""
 
jobs_to_check = self.job_db.execute(
 
"SELECT job_id, exec_id FROM slurm_jobs WHERE finished = 0"
 
).fetchall()
 
if not jobs_to_check:
 
log.debug("No unfinished jobs to check")
 
return {}
    • Comment on lines +163 to +165

      While I was touching this code, I rearranged it so we don't run squeue at all when there are no unfinished jobs in the database.

Please register or sign in to reply
 
statii = slurm_status()
statii = slurm_status()
# Check that slurm is giving proper feedback
# Check that slurm is giving proper feedback
if statii is None:
if statii is None:
return {}
return {}
log.debug(f"SLURM info {statii}")
log.debug(f"SLURM info {statii}")
jobs_to_check = self.job_db.execute(
"SELECT job_id, exec_id FROM slurm_jobs WHERE finished = 0"
).fetchall()
ongoing_jobs_by_exn = {}
ongoing_jobs_by_exn = {}
updates = []
updates = []
for r in jobs_to_check:
for r in jobs_to_check:
@@ -166,13 +179,13 @@ class JobsMonitor:
@@ -166,13 +179,13 @@ class JobsMonitor:
if str(r['job_id']) in statii:
if str(r['job_id']) in statii:
# statii contains jobs which are still going (from squeue)
# statii contains jobs which are still going (from squeue)
slstatus, runtime = statii[str(r['job_id'])]
slstatus, runtime = statii[str(r['job_id'])]
finished = False
execn_ongoing_jobs.append(f"{slstatus}-{runtime}")
execn_ongoing_jobs.append(f"{slstatus}-{runtime}")
else:
else:
# These jobs have finished (successfully or otherwise)
# These jobs have finished (successfully or otherwise)
_, runtime, slstatus = slurm_job_status(r['job_id'])
_, runtime, slstatus = slurm_job_status(r['job_id'])
finished = True
 
finished = slstatus in STATES_FINISHED
    • You raised the concern we may have missed some finished state in the set. Would it be more robust to check for "non-completion", i.e. running?

      • I think the list of states which mean still going/waiting in various ways is slightly longer than those which mean finished, so there's probably a similar chance of mistakenly including something in that list as there is of missing it out of this one. :shrug: Though I guess doing it that way would fail quickly and obviously if a new version of Slurm adds further states to the list.

        Some examples of statuses which I'm unsure about:

        • "RESV_DEL_HOLD: Job is being held after requested reservation was deleted." - in theory I guess the job could be modified and then released to run, but in practice we'd probably start a new job instead.
        • "REVOKED: Sibling was removed from cluster due to other cluster starting the job." - we don't use the multi-cluster features, and I don't know if squeue shows all clusters or just the one you're on.
        • "SPECIAL_EXIT: The job was requeued in a special state. This state can be set by users, typically in EpilogSlurmctld, if the job has terminated with a particular exit value." - Exit implies it's finished, but requeued implies it will run again.
      • Using both start and end date, it appears you're able to query whether any given state ever occured, e.g.:

        [xcal@max-exfl016 current]$ sacct -S now-10000day -E now -u xcal -s NODE_FAIL
        JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
        ------------ ---------- ---------- ---------- ---------- ---------- -------- 
        19180        correct_M+      exfel      exfel         72  NODE_FAIL      0:0 
        19180.batch       batch                 exfel         72  CANCELLED          
        1048849      correct_S+ upex-midd+      exfel         72  NODE_FAIL      0:0 
        1048849.bat+      batch                 exfel         72  CANCELLED          
        1048850      correct_S+ upex-midd+      exfel         72  NODE_FAIL      0:0 
        1048850.bat+      batch                 exfel         72  CANCELLED          
        1048863      correct_S+ upex-midd+      exfel         72  NODE_FAIL      0:0 
        1048863.bat+      batch                 exfel         72  CANCELLED          
        1048960      correct_S+ upex-midd+      exfel         72  NODE_FAIL      0:0 
        1048960.bat+      batch                 exfel         72  CANCELLED          
      • Hmmm, good point. According to sacct, no xcal jobs have been REQUEUED, and a couple of specific ones I checked where we saw this kind of issue (1020360 & 1026886) have never been in any state other than PENDING, RUNNING or COMPLETED. :confused:

        However, some other jobs, including recent ones (latest 1026802 on 2023-04-01), have been PREEMPTED but are now COMPLETED. I thought PREEMPTED was a final state for jobs that couldn't be requeued, but I guess I was wrong about that.

        So I'm back to not really knowing what's going on. Maybe the accounting database sacct uses is sometimes slightly outdated, so squeue knows the job has finished but sacct doesn't. If that's the case, this would fix it, but I'm not convinced.

      • Please register or sign in to reply
Please register or sign in to reply
updates.append((finished, runtime, slstatus, r['job_id']))
updates.append((finished, runtime, slstatus, r['job_id']))
Loading