[Webservice] Monitor Slurm jobs in separate process
Description
Now that we have a process supervisor for the webservice & serve_overview process, it makes sense for this to be a separate process as well, rather than a thread in the webservice process (which was convenient when we launched it manually). The diff here is mostly just moving code that already exists into a separate file for clarity.
This means its logs will be visible separately, and the supervisor can restart the job monitor if it fails. The overview server (http://max-exfl016.desy.de:8008/ ) will no longer show log messages from job monitoring, which it currently does. We could add them as a separate block if needed, or use caldeploy logs
to look at them.
A related change will be needed in the deployment tools.
How Has This Been Tested?
Run on max-exfl017, see comment below.
Types of changes
- Refactor (refactoring code with no functionality changes)
Checklist:
- My code follows the code style of this project.
Reviewers
Merge request reports
Activity
https://git.xfel.eu/calibration/deployment-tools/-/merge_requests/4 is the corresponding change to the deployment tools.
- webservice/job_monitor.py 0 → 100644
50 of (status, run time) as values. 51 """ 52 cmd = ["squeue"] 53 if filter_user: 54 cmd += ["--me"] 55 res = run(cmd, stdout=PIPE) 56 if res.returncode == 0: 57 rlines = res.stdout.decode().split("\n") 58 statii = {} 59 for r in rlines[1:]: 60 try: 61 jobid, _, _, _, status, runtime, _, _ = r.split() 62 jobid = jobid.strip() 63 statii[jobid] = status, runtime 64 except ValueError: # not enough values to unpack in split 65 pass - Comment on lines +60 to +65
60 try: 61 jobid, _, _, _, status, runtime, _, _ = r.split() 62 jobid = jobid.strip() 63 statii[jobid] = status, runtime 64 except ValueError: # not enough values to unpack in split 65 pass 60 # Ignore errors if there are not enough values to unpack in split 61 with contextlib.suppress(ValueError): 62 jobid, _, _, _, status, runtime, _, _ = r.split() 63 jobid = jobid.strip() 64 statii[jobid] = status, runtime Requires
import contextlib
Minor suggestion since it's a bit neater than try/except/pass imo
- Resolved by Thomas Kluyver
- Resolved by Thomas Kluyver
- Resolved by Thomas Kluyver
- Resolved by Thomas Kluyver
I didn't have a specific log in mind. I think all logs are useful :). I see that there are a few error logs. Connection to Kafka, connection to job DB, and connection to myMDC. I think in principle except for Kafka connection, the webservice would show as well if there are errors while connecting to the job-db and myMDC.
So maybe for now it is not very useful, but if we get to add more features for the job monitoring it might be useful.
I know that many in DOC checked the webpage for webservice errors instead of caldeploy logs or opening web.log file.
Edited by Karim Ahmed
I've added the job monitor logs to the overview page; I've currently got this branch deployed on max-exfl017 for testing if you want to look at http://max-exfl017.desy.de:8008/ .
Here's a screenshot of the overview page with this branch:
Edited by Thomas Kluyvermentioned in commit e17cb7d4
mentioned in merge request !543 (closed)
mentioned in merge request !683 (merged)
changed milestone to %3.6.0