Skip to content
Snippets Groups Projects

[Webservice] Restructure database to give more meaningful success/failure information

Merged Thomas Kluyver requested to merge webservice-refactor-db into master

Description

At present, the success/failure information we send to myMdC is unreliable in various important ways - see calibration/planning#104 . In addition, the correction_complete events sent via Kafka are sent per detector, but they don't actually tell you which detector they relate to (they give a detector type, such as JUNGFRAU, but not the Karabo ID of a specific detector). Replacing the webservice still looks like a big enough job that it seemed worth spending a bit of time improving the old one. I hope this might also ease the transition to the new one when it's ready.

The jobs database is completely redesigned, with three tables instead of one: each request from myMdC can have several executions, each of which can have several slurm_jobs. I think these terms line up with what Robert is using in Orca. I haven't tried to do any migration of existing data - we're only using it to track active jobs, so I'm assuming we'll just start with an empty database when deploying this change.

The jobs monitor code is also largely refactored, with the loop broken up into several methods in a JobsMonitor class.

How Has This Been Tested?

Deployed on max-exfl017, and submitted correction & dark requests from the test myMdC instance for the CALLAB proposal.

This revealed a number of previously hidden issues which are (hopefully) specific to CALLAB. The presence of JNGFR01 in the filenames triggers an execution for both FXE_XAD_JF1M and SPB_IRDA_JF4M, at least one of which will fail on any given run (as it can't find its data). Previously, the webservice would hide this, because the failing jobs would fail quickly, and the status would only come from the last jobs running, but with these changes the failures are visible.

Relevant Documents (optional)

https://in.xfel.eu/test_metadata/proposals/259/runs/63612

image

https://in.xfel.eu/test_metadata/proposals/259/runs/86193

image

https://in.xfel.eu/test_metadata/proposals/259#proposal-calibration

image

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor (refactoring code with no functionality changes)

Checklist:

  • My code follows the code style of this project.

Reviewers

@ahmedk @roscar @schmidtp

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
210 "UPDATE executions SET success = ? WHERE exec_id = ?",
211 (success, exec_id)
212 )
213 log.info("Execution finished: %s for (p%s, r%s, %s), success=%s",
214 r['action'], r['proposal'], r['run'], r['karabo_id'], success)
215
216 if r['action'] == 'CORRECT':
217 try:
218 self.kafka_prod.send(self.kafka_topic, {
219 'event': 'correction_complete',
220 'proposal': r['proposal'],
221 'run': r['run'],
222 'detector': r['det_type'],
223 'karabo_id': r['karabo_id'],
224 'success': success,
225 })
  • Comment on lines +218 to +225

    This is the message we're already sending (per execution). I've added the karabo_id field. I think 'Karabo ID' might be just what we call this internally for calibration, so maybe we should find a better external name for it.

  • The canonical name according to the Karabo naming convention for the first part of a device ID is domain. I wouldn't have hard feelings against Karabo ID if not for the unfortunate similarity to device ID.

    So I would propose either "karabo-domain" or something Karabo-agnostic like "detector-installation" or just "installation".

  • OK, I'll go with detector_installation for now; it's long but I think it's quite clear.

  • FYI, I am aware that so far we have karabo_id (notebooks), detector_identifier(CALCAT APIs), and domain in the mentioned documentation.

    Will we really benefit from adding a new name?

    Edited by Karim Ahmed
  • Good point. @schmidtp how does detector_identifier sound to you?

  • Sure, fine by me.

    In all fairness CalCat also has detector_name and uses karabo_id itself in some APIs :face_palm: It's a real mess.

  • Please register or sign in to reply
  • 239
    240 r = self.job_db.execute(
    241 "SELECT * FROM requests WHERE req_id = ?", (req_id,)
    242 ).fetchone()
    243 log.info(
    244 "Jobs finished - action: %s, myMdC id: %s, success: %s",
    245 r['action'], r['mymdc_id'], success,
    246 )
    247 if r['action'] == 'CORRECT':
    248 try:
    249 self.kafka_prod.send(self.kafka_topic, {
    250 'event': 'run_corrections_complete',
    251 'proposal': r['proposal'],
    252 'run': r['run'],
    253 'success': success,
    254 })
  • 241 if status in ['R', 'PD']:
    242 flg = 'R'
    243 elif status == 'NA':
    244 flg = 'NA'
    245 else:
    246 flg = 'A'
    247
    248 cflg.append(flg)
    249 cstatus.append(status)
    250 combined[rid] = cflg, cstatus
    256 c = conn.execute(
    257 "SELECT job_id, status FROM slurm_jobs "
    258 "INNER JOIN executions USING (exec_id) "
    259 "INNER JOIN requests USING (req_id) "
    260 "WHERE mymdc_id = ?", (rid,)
    261 )
    • Comment on lines +256 to +261

      If query-rid is actually used often, we should probably make an index on mymdc_id in the database to make this lookup more efficient.

    • Please register or sign in to reply
  • Thomas Kluyver changed milestone to %3.6.0

    changed milestone to %3.6.0

  • 88 def __init__(self, config):
    89 log.info("Starting jobs monitor")
    90 self.job_db = init_job_db(config)
    91 self.mdc = init_md_client(config)
    92 self.kafka_prod = init_kafka_producer(config)
    93 self.kafka_topic = config['kafka']['topic']
    94 self.time_interval = int(config['web-service']['job-update-interval'])
    95
    96 def run(self):
    97 while True:
    98 try:
    99 self.do_updates()
    100 except Exception:
    101 log.error("Failure to update job DB", exc_info=True)
    102 time.sleep(self.time_interval)
    103
    • Comment on lines +96 to +103

      For supervisord to manage the process in a (slightly) nicer way, it may be worth adding something here to catch sigkill/interrupt and explicitly handle it. Maybe adding something like:

          def run(self):
              while True:
                  try:
                      self.do_updates()
                  except KeyboardInterrupt:
                      self.stop("`KeyboardInterrupt`")
                  except Exception:
                      log.error("Failure to update job DB", exc_info=True)
                  time.sleep(self.time_interval)
      
          def stop(self, *args, **kwargs):
              log.critical(f"Shutting down: {args}, {kwargs}")
              exit(0)

      With a bit at the end of main that does something like:

      def main(argv=None):
          ...
          logging.getLogger('kafka').setLevel(logging.INFO)
          jm = JobsMonitor(config)
      
          signal.signal(signal.SIGTERM, jm.stop)
          signal.signal(signal.SIGINT, jm.stop)
      
          jm.run()
    • Good point, see what you think of the implementation I've just committed.

    • Please register or sign in to reply
  • Thomas Kluyver added 1 commit

    added 1 commit

    Compare with previous version

  • Thomas Kluyver added 1 commit

    added 1 commit

    • 0f67b362 - Rename karabo_id -> detector_installation in Kafka message

    Compare with previous version

  • Looks good, thanks!

    Given you restructured the database, we probably need changes to serve_overview?

    Does it need to be added to caldeploy explicitly?

    • Given you restructured the database, we probably need changes to serve_overview?

      Good point, thanks - I had missed that serve_overview accesses the database directly as well. I'll get on that.

      That reminds me: do we also want to store the python -m xfel-calibrate ... command line in the database (in the executions table)? It would make the database grow a bit quicker, but we could get rid of some log parsing in serve_overview by reading the database instead.

      Does it need to be added to caldeploy explicitly?

      The database? Or something else? caldeploy doesn't currently know anything about the database - it's all just set up by the webservice code when it's launched.

    • The database? Or something else? caldeploy doesn't currently know anything about the database - it's all just set up by the webservice code when it's launched.

      Having it tracked and launched.

      do we also want to store the python -m xfel-calibrate ... command line in the database (in the executions table)?

      I'm not sure. I always thought of xfel-calibrate as stateless. I suppose you suggest having some command line argument to the sqlite database passed by the webservice, but not used interactively?

    • Having it tracked and launched.

      This doesn't add anything new to launch - there was another MR that already split the job monitor out as a separate process (!668 (merged)).

      I'm not sure. I always thought of xfel-calibrate as stateless. I suppose you suggest having some command line argument to the sqlite database passed by the webservice, but not used interactively?

      I think we might be talking at cross purposes here. My suggestion was about storing the xfel-calibrate command that the webservice launches in the database, not about passing anything extra in that command.

      I've briefly tested serve_overview with my new commit (8a3f12bc).

      image

    • I think we might be talking at cross purposes here. My suggestion was about storing the xfel-calibrate command that the webservice launches in the database, not about passing anything extra in that command.

      Yes, I think I misread that sentence grammatically. Sure, makes sense. I do not think we should worry about space.

      Ultimately I was considering asking whether we want to move to MySQL now already, but it should be its own MR.

    • OK, I'll do that. :thumbsup:

      Do you see any particular reason to change DB? I'm probably guilty of being an SQLite fanboy, but it's very convenient not having a separate server to manage or authentication to deal with. And in this context, I quite like that we naturally get a new database file on each deployment, so you don't have to worry about migrations or anything (although I have on occasion copied the file over from one deployment to the next).

    • It's mainly about integration with Grafana. Right now we essentially have yet another service that translates into MySQL...

    • OK, I've added the command to the database, and updated the serve_overview code to use that. I've just checked on max-exfl017 that it's reading jobs back correctly from the database:

      image

    • Please register or sign in to reply
  • Thomas Kluyver added 1 commit

    added 1 commit

    • 8a3f12bc - Use new database structure in serve_overview

    Compare with previous version

  • Thomas Kluyver added 1 commit

    added 1 commit

    • 85959cde - Update sqlite_view.py script

    Compare with previous version

  • Thomas Kluyver added 1 commit

    added 1 commit

    • 55df983d - Ensure run numbers are consistently integers

    Compare with previous version

  • Thomas Kluyver added 2 commits

    added 2 commits

    • 4f068c21 - Parse run numbers as integers in sqlite_view.py
    • fc6b814f - Clean up sqlite_view script a bit

    Compare with previous version

  • Thomas Kluyver added 1 commit

    added 1 commit

    • f88fc9d0 - Convert run number back to string for Kafka messages

    Compare with previous version

  • Thomas Kluyver added 1 commit

    added 1 commit

    • 0b847163 - Rename detector_installation -> detector_identifier in Kafka message

    Compare with previous version

  • Karim Ahmed
  • Thomas Kluyver added 2 commits

    added 2 commits

    • 452133c8 - Store xfel-calibrate command in jobs database
    • dacb2cfc - Use commands from database in serve_overview instead of log scraping

    Compare with previous version

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading