[Webservice] Fix database lock timeouts
Description
We have been seeing errors in the webservice where it tries to write to the database and times out on a lock. I found 12 instances in the log, with tracebacks pointing to 3 different INSERT statements.
Traceback (most recent call last):
File "/home/xcal/deployments/development/git.xfel.eu/detectors/pycalibration/pycalibration-2023-03-22-3.9.2-9e88f31/webservice/webservice.py", line 1034, in _continue
ret, _ = await self.launch_jobs(
File "/home/xcal/deployments/development/git.xfel.eu/detectors/pycalibration/pycalibration-2023-03-22-3.9.2-9e88f31/webservice/webservice.py", line 1397, in launch_jobs
cur = self.job_db.execute(
sqlite3.OperationalError: database is locked
This indicates that something running concurrently with the webservice is keeping the database locked (i.e. holding a transaction open) for several seconds. The jobs monitor and the web overview (http://max-exfl016.desy.de:8008/) are the only other things that routinely access this database.
I think I've found the culprit in the job monitor code. When updating job statuses, it was issuing SQL updates in a loop which also runs sacct
to get job statuses, and then committing (finishing the transaction) outside the loop. If a lot of jobs had finished since the last update, or sacct
was slow for some reason, this would mean the database was locked until it finished. I've addressed this by batching the updates needed and writing to the database outside the loop.
I also added some timing code wherever we write to the database, so we can see if it's taking longer than expected.
How Has This Been Tested?
TBD
Types of changes
- Bug fix (non-breaking change which fixes an issue)
Checklist:
- My code follows the code style of this project.