[Webservice] Fix database lock timeouts
Description
We have been seeing errors in the webservice where it tries to write to the database and times out on a lock. I found 12 instances in the log, with tracebacks pointing to 3 different INSERT statements.
Traceback (most recent call last):
File "/home/xcal/deployments/development/git.xfel.eu/detectors/pycalibration/pycalibration-2023-03-22-3.9.2-9e88f31/webservice/webservice.py", line 1034, in _continue
ret, _ = await self.launch_jobs(
File "/home/xcal/deployments/development/git.xfel.eu/detectors/pycalibration/pycalibration-2023-03-22-3.9.2-9e88f31/webservice/webservice.py", line 1397, in launch_jobs
cur = self.job_db.execute(
sqlite3.OperationalError: database is locked
This indicates that something running concurrently with the webservice is keeping the database locked (i.e. holding a transaction open) for several seconds. The jobs monitor and the web overview (http://max-exfl016.desy.de:8008/) are the only other things that routinely access this database.
I think I've found the culprit in the job monitor code. When updating job statuses, it was issuing SQL updates in a loop which also runs sacct
to get job statuses, and then committing (finishing the transaction) outside the loop. If a lot of jobs had finished since the last update, or sacct
was slow for some reason, this would mean the database was locked until it finished. I've addressed this by batching the updates needed and writing to the database outside the loop.
I also added some timing code wherever we write to the database, so we can see if it's taking longer than expected.
How Has This Been Tested?
TBD
Types of changes
- Bug fix (non-breaking change which fixes an issue)
Checklist:
- My code follows the code style of this project.
Reviewers
Merge request reports
Activity
173 174 _, runtime, slstatus = slurm_job_status(r['job_id']) 174 175 finished = True 175 176 176 c.execute( 177 updates.append((finished, runtime, slstatus, r['job_id'])) 178 179 with time_db_transaction(self.job_db, 'Update jobs'): 180 self.job_db.executemany( 177 181 "UPDATE slurm_jobs SET finished=?, elapsed=?, status=? WHERE job_id = ?", 178 (finished, runtime, slstatus, r['job_id']) 182 updates 179 183 ) Yup. Read operations need to get a shared lock, and write operations need an exclusive lock. It will wait for 5 seconds (this is a parameter when you open the database) for the lock, and if it takes longer than that, you get this OperationalError.
So we should aim to make database operations, especially writes, as 'point like' as possible. In this example, we could also commit after each update command, to make N very small transactions. This has some upsides, but it's inefficient, because commit is meant to wait until data is fully flushed to disk.
- Resolved by Thomas Kluyver
Thank you @kluyvert that is a smart way to fix it. I would also suggest adding docstring to the new class.
Besides that, LGTM
Tested on max-exfl017; I see log messages like these:
# Webservice 2023-04-17 15:37:49,774 - root - DEBUG - [webservice.py:330] DB change (Insert execution): 1.5 ms in transaction, 4.1 ms commit 2023-04-17 15:46:58,817 - root - DEBUG - [webservice.py:330] DB change (Insert jobs): 1.0 ms in transaction, 5.3 ms commit # Job monitor 2023-04-17 15:55:13,280 - root - DEBUG - [webservice.py:330] DB change (Update jobs): 0.8 ms in transaction, 3.6 ms commit 2023-04-17 15:55:13,284 - root - DEBUG - [webservice.py:330] DB change (Update execution): 0.3 ms in transaction, 3.8 ms commit
Calling
.commit()
to finish the transaction seems to take longer than executing the SQL to insert/update rows. I think this makes sense - commit should flush the changes to disk, which will take time.If the 'in transaction' time was larger, that might indicate that it's waiting for the lock to modify the database. If the 'commit' time was much longer, that might mean GPFS hiccuped as we were trying to write.
mentioned in commit 9d3bde27
changed milestone to %3.10.0