[Webservice] Restructure database to give more meaningful success/failure information
Description
At present, the success/failure information we send to myMdC is unreliable in various important ways - see calibration/planning#104 . In addition, the
correction_complete
events sent via Kafka are sent per detector, but they don't actually tell you which detector they relate to (they give a detector type, such as JUNGFRAU, but not the Karabo ID of a specific detector). Replacing the webservice still looks like a big enough job that it seemed worth spending a bit of time improving the old one. I hope this might also ease the transition to the new one when it's ready.
The jobs database is completely redesigned, with three tables instead of one: each request from myMdC can have several executions, each of which can have several slurm_jobs. I think these terms line up with what Robert is using in Orca. I haven't tried to do any migration of existing data - we're only using it to track active jobs, so I'm assuming we'll just start with an empty database when deploying this change.
The jobs monitor code is also largely refactored, with the loop broken up into several methods in a JobsMonitor
class.
How Has This Been Tested?
Deployed on max-exfl017, and submitted correction & dark requests from the test myMdC instance for the CALLAB proposal.
This revealed a number of previously hidden issues which are (hopefully) specific to CALLAB. The presence of JNGFR01 in the filenames triggers an execution for both FXE_XAD_JF1M
and SPB_IRDA_JF4M
, at least one of which will fail on any given run (as it can't find its data). Previously, the webservice would hide this, because the failing jobs would fail quickly, and the status would only come from the last jobs running, but with these changes the failures are visible.
Relevant Documents (optional)
https://in.xfel.eu/test_metadata/proposals/259/runs/63612
https://in.xfel.eu/test_metadata/proposals/259/runs/86193
https://in.xfel.eu/test_metadata/proposals/259#proposal-calibration
Types of changes
- Bug fix (non-breaking change which fixes an issue)
- New feature (non-breaking change which adds functionality)
- Refactor (refactoring code with no functionality changes)
Checklist:
- My code follows the code style of this project.
Reviewers
Merge request reports
Activity
210 "UPDATE executions SET success = ? WHERE exec_id = ?", 211 (success, exec_id) 212 ) 213 log.info("Execution finished: %s for (p%s, r%s, %s), success=%s", 214 r['action'], r['proposal'], r['run'], r['karabo_id'], success) 215 216 if r['action'] == 'CORRECT': 217 try: 218 self.kafka_prod.send(self.kafka_topic, { 219 'event': 'correction_complete', 220 'proposal': r['proposal'], 221 'run': r['run'], 222 'detector': r['det_type'], 223 'karabo_id': r['karabo_id'], 224 'success': success, 225 }) The canonical name according to the Karabo naming convention for the first part of a device ID is domain. I wouldn't have hard feelings against Karabo ID if not for the unfortunate similarity to device ID.
So I would propose either "karabo-domain" or something Karabo-agnostic like "detector-installation" or just "installation".
FYI, I am aware that so far we have karabo_id (notebooks), detector_identifier(CALCAT APIs), and domain in the mentioned documentation.
Will we really benefit from adding a new name?
Edited by Karim AhmedGood point. @schmidtp how does
detector_identifier
sound to you?
239 240 r = self.job_db.execute( 241 "SELECT * FROM requests WHERE req_id = ?", (req_id,) 242 ).fetchone() 243 log.info( 244 "Jobs finished - action: %s, myMdC id: %s, success: %s", 245 r['action'], r['mymdc_id'], success, 246 ) 247 if r['action'] == 'CORRECT': 248 try: 249 self.kafka_prod.send(self.kafka_topic, { 250 'event': 'run_corrections_complete', 251 'proposal': r['proposal'], 252 'run': r['run'], 253 'success': success, 254 }) 241 if status in ['R', 'PD']: 242 flg = 'R' 243 elif status == 'NA': 244 flg = 'NA' 245 else: 246 flg = 'A' 247 248 cflg.append(flg) 249 cstatus.append(status) 250 combined[rid] = cflg, cstatus 256 c = conn.execute( 257 "SELECT job_id, status FROM slurm_jobs " 258 "INNER JOIN executions USING (exec_id) " 259 "INNER JOIN requests USING (req_id) " 260 "WHERE mymdc_id = ?", (rid,) 261 ) added Waiting for review label
changed milestone to %3.6.0
88 def __init__(self, config): 89 log.info("Starting jobs monitor") 90 self.job_db = init_job_db(config) 91 self.mdc = init_md_client(config) 92 self.kafka_prod = init_kafka_producer(config) 93 self.kafka_topic = config['kafka']['topic'] 94 self.time_interval = int(config['web-service']['job-update-interval']) 95 96 def run(self): 97 while True: 98 try: 99 self.do_updates() 100 except Exception: 101 log.error("Failure to update job DB", exc_info=True) 102 time.sleep(self.time_interval) 103 - Comment on lines +96 to +103
For supervisord to manage the process in a (slightly) nicer way, it may be worth adding something here to catch sigkill/interrupt and explicitly handle it. Maybe adding something like:
def run(self): while True: try: self.do_updates() except KeyboardInterrupt: self.stop("`KeyboardInterrupt`") except Exception: log.error("Failure to update job DB", exc_info=True) time.sleep(self.time_interval) def stop(self, *args, **kwargs): log.critical(f"Shutting down: {args}, {kwargs}") exit(0)
With a bit at the end of
main
that does something like:def main(argv=None): ... logging.getLogger('kafka').setLevel(logging.INFO) jm = JobsMonitor(config) signal.signal(signal.SIGTERM, jm.stop) signal.signal(signal.SIGINT, jm.stop) jm.run()
added 1 commit
- 0f67b362 - Rename karabo_id -> detector_installation in Kafka message
Given you restructured the database, we probably need changes to
serve_overview
?Good point, thanks - I had missed that serve_overview accesses the database directly as well. I'll get on that.
That reminds me: do we also want to store the
python -m xfel-calibrate ...
command line in the database (in the executions table)? It would make the database grow a bit quicker, but we could get rid of some log parsing in serve_overview by reading the database instead.Does it need to be added to
caldeploy
explicitly?The database? Or something else?
caldeploy
doesn't currently know anything about the database - it's all just set up by the webservice code when it's launched.The database? Or something else?
caldeploy
doesn't currently know anything about the database - it's all just set up by the webservice code when it's launched.Having it tracked and launched.
do we also want to store the
python -m xfel-calibrate ...
command line in the database (in the executions table)?I'm not sure. I always thought of
xfel-calibrate
as stateless. I suppose you suggest having some command line argument to the sqlite database passed by the webservice, but not used interactively?Having it tracked and launched.
This doesn't add anything new to launch - there was another MR that already split the job monitor out as a separate process (!668 (merged)).
I'm not sure. I always thought of
xfel-calibrate
as stateless. I suppose you suggest having some command line argument to the sqlite database passed by the webservice, but not used interactively?I think we might be talking at cross purposes here. My suggestion was about storing the
xfel-calibrate
command that the webservice launches in the database, not about passing anything extra in that command.I've briefly tested
serve_overview
with my new commit (8a3f12bc).I think we might be talking at cross purposes here. My suggestion was about storing the
xfel-calibrate
command that the webservice launches in the database, not about passing anything extra in that command.Yes, I think I misread that sentence grammatically. Sure, makes sense. I do not think we should worry about space.
Ultimately I was considering asking whether we want to move to MySQL now already, but it should be its own MR.
OK, I'll do that.
Do you see any particular reason to change DB? I'm probably guilty of being an SQLite fanboy, but it's very convenient not having a separate server to manage or authentication to deal with. And in this context, I quite like that we naturally get a new database file on each deployment, so you don't have to worry about migrations or anything (although I have on occasion copied the file over from one deployment to the next).
added 1 commit
- 8a3f12bc - Use new database structure in serve_overview
- Resolved by Thomas Kluyver
Nice MR @kluyvert . I think we need to include this script as well. As it breaks now with these changes https://git.xfel.eu/detectors/pycalibration/-/blob/master/webservice/sqlite_view.py
Edited by Karim Ahmed
added 1 commit
- 55df983d - Ensure run numbers are consistently integers
added 1 commit
- f88fc9d0 - Convert run number back to string for Kafka messages
added 1 commit
- 0b847163 - Rename detector_installation -> detector_identifier in Kafka message
- Resolved by Karim Ahmed