Skip to content
Snippets Groups Projects

[Webservice] Monitor Slurm jobs in separate process

Merged Thomas Kluyver requested to merge separate-job-monitor into master
2 unresolved threads

Description

Now that we have a process supervisor for the webservice & serve_overview process, it makes sense for this to be a separate process as well, rather than a thread in the webservice process (which was convenient when we launched it manually). The diff here is mostly just moving code that already exists into a separate file for clarity.

This means its logs will be visible separately, and the supervisor can restart the job monitor if it fails. The overview server (http://max-exfl016.desy.de:8008/ ) will no longer show log messages from job monitoring, which it currently does. We could add them as a separate block if needed, or use caldeploy logs to look at them.

A related change will be needed in the deployment tools.

How Has This Been Tested?

Run on max-exfl017, see comment below.

Types of changes

  • Refactor (refactoring code with no functionality changes)

Checklist:

  • My code follows the code style of this project.

Reviewers

@schmidtp @roscar

Edited by Thomas Kluyver

Merge request reports

Checking pipeline status.

Approved by

Merged by Thomas KluyverThomas Kluyver 2 years ago (May 19, 2022 4:28pm UTC)

Merge details

  • Changes merged into master with e17cb7d4.
  • Deleted the source branch.

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
50 of (status, run time) as values.
51 """
52 cmd = ["squeue"]
53 if filter_user:
54 cmd += ["--me"]
55 res = run(cmd, stdout=PIPE)
56 if res.returncode == 0:
57 rlines = res.stdout.decode().split("\n")
58 statii = {}
59 for r in rlines[1:]:
60 try:
61 jobid, _, _, _, status, runtime, _, _ = r.split()
62 jobid = jobid.strip()
63 statii[jobid] = status, runtime
64 except ValueError: # not enough values to unpack in split
65 pass
  • Comment on lines +60 to +65
    Suggested change
    60 try:
    61 jobid, _, _, _, status, runtime, _, _ = r.split()
    62 jobid = jobid.strip()
    63 statii[jobid] = status, runtime
    64 except ValueError: # not enough values to unpack in split
    65 pass
    60 # Ignore errors if there are not enough values to unpack in split
    61 with contextlib.suppress(ValueError):
    62 jobid, _, _, _, status, runtime, _, _ = r.split()
    63 jobid = jobid.strip()
    64 statii[jobid] = status, runtime

    Requires import contextlib

    Minor suggestion since it's a bit neater than try/except/pass imo

  • I think I marginally prefer the old-school version without an extra import, even if it's slightly more lines.

  • Please register or sign in to reply
  • Robert Rosca
  • Robert Rosca
  • Robert Rosca
  • Robert Rosca
  • Nothing major, my comments are all pretty minor and optional, so LGTM :thumbsup:

  • Robert Rosca approved this merge request

    approved this merge request

  • LGTM, I think it is a good point that we need to add back the logs for the job monitoring along with logs in the serve_overview.

    • Do you think the job monitoring logs would be useful? I don't use the overview page much myself, but I'm conscious that it's pretty slow because it loads all of the information it needs on every request, and adding another file to read will make it slower still.

    • I didn't have a specific log in mind. I think all logs are useful :). I see that there are a few error logs. Connection to Kafka, connection to job DB, and connection to myMDC. I think in principle except for Kafka connection, the webservice would show as well if there are errors while connecting to the job-db and myMDC.

      So maybe for now it is not very useful, but if we get to add more features for the job monitoring it might be useful.

      I know that many in DOC checked the webpage for webservice errors instead of caldeploy logs or opening web.log file.

      Edited by Karim Ahmed
    • Please register or sign in to reply
  • Thomas Kluyver added 1 commit

    added 1 commit

    • 18f1004d - Apply 3 suggestion(s) to 1 file(s)

    Compare with previous version

  • Thomas Kluyver added 1 commit

    added 1 commit

    • 660dbe63 - Add job monitor logs to serve_overview

    Compare with previous version

  • Thomas Kluyver added 1 commit

    added 1 commit

    Compare with previous version

  • I've added the job monitor logs to the overview page; I've currently got this branch deployed on max-exfl017 for testing if you want to look at http://max-exfl017.desy.de:8008/ .

    Here's a screenshot of the overview page with this branch:

    image

    Edited by Thomas Kluyver
  • Thomas Kluyver changed the description

    changed the description

  • Thomas Kluyver mentioned in commit e17cb7d4

    mentioned in commit e17cb7d4

  • Philipp Schmidt mentioned in merge request !543 (closed)

    mentioned in merge request !543 (closed)

  • Thomas Kluyver mentioned in merge request !683 (merged)

    mentioned in merge request !683 (merged)

  • Philipp Schmidt changed milestone to %3.6.0

    changed milestone to %3.6.0

  • Please register or sign in to reply
    Loading