Skip to content
Snippets Groups Projects

[Webservice] Fix database lock timeouts

Merged Thomas Kluyver requested to merge fix/websvc-db-write-locking into master
1 unresolved thread

Description

We have been seeing errors in the webservice where it tries to write to the database and times out on a lock. I found 12 instances in the log, with tracebacks pointing to 3 different INSERT statements.

Traceback (most recent call last):                                                                                                                                                                        
  File "/home/xcal/deployments/development/git.xfel.eu/detectors/pycalibration/pycalibration-2023-03-22-3.9.2-9e88f31/webservice/webservice.py", line 1034, in _continue
    ret, _ = await self.launch_jobs(                                                                                                                                                                      
  File "/home/xcal/deployments/development/git.xfel.eu/detectors/pycalibration/pycalibration-2023-03-22-3.9.2-9e88f31/webservice/webservice.py", line 1397, in launch_jobs
    cur = self.job_db.execute(                                                                                                                                                                            
sqlite3.OperationalError: database is locked

This indicates that something running concurrently with the webservice is keeping the database locked (i.e. holding a transaction open) for several seconds. The jobs monitor and the web overview (http://max-exfl016.desy.de:8008/) are the only other things that routinely access this database.

I think I've found the culprit in the job monitor code. When updating job statuses, it was issuing SQL updates in a loop which also runs sacct to get job statuses, and then committing (finishing the transaction) outside the loop. If a lot of jobs had finished since the last update, or sacct was slow for some reason, this would mean the database was locked until it finished. I've addressed this by batching the updates needed and writing to the database outside the loop.

I also added some timing code wherever we write to the database, so we can see if it's taking longer than expected.

How Has This Been Tested?

TBD

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • My code follows the code style of this project.

Reviewers

@ahmedk @schmidtp

Edited by Thomas Kluyver

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
173 174 _, runtime, slstatus = slurm_job_status(r['job_id'])
174 175 finished = True
175 176
176 c.execute(
177 updates.append((finished, runtime, slstatus, r['job_id']))
178
179 with time_db_transaction(self.job_db, 'Update jobs'):
180 self.job_db.executemany(
177 181 "UPDATE slurm_jobs SET finished=?, elapsed=?, status=? WHERE job_id = ?",
178 (finished, runtime, slstatus, r['job_id'])
182 updates
179 183 )
  • Nice find and fix, LGTM.

    Just for my understanding, the exception appears when a timeout is reached while waiting for the lock?

  • Yup. Read operations need to get a shared lock, and write operations need an exclusive lock. It will wait for 5 seconds (this is a parameter when you open the database) for the lock, and if it takes longer than that, you get this OperationalError.

    So we should aim to make database operations, especially writes, as 'point like' as possible. In this example, we could also commit after each update command, to make N very small transactions. This has some upsides, but it's inefficient, because commit is meant to wait until data is fully flushed to disk.

  • Karim Ahmed
  • Thomas Kluyver added 1 commit

    added 1 commit

    Compare with previous version

  • Thank you @kluyvert that is a smart way to fix it. I would also suggest adding docstring to the new class.

    Besides that, LGTM

  • Thomas Kluyver added 1 commit

    added 1 commit

    • 59494930 - Add docstring for time_db_transaction

    Compare with previous version

  • Tested on max-exfl017; I see log messages like these:

    # Webservice
    2023-04-17 15:37:49,774 - root - DEBUG - [webservice.py:330] DB change (Insert execution): 1.5 ms in transaction, 4.1 ms commit
    2023-04-17 15:46:58,817 - root - DEBUG - [webservice.py:330] DB change (Insert jobs): 1.0 ms in transaction, 5.3 ms commit
    
    # Job monitor
    2023-04-17 15:55:13,280 - root - DEBUG - [webservice.py:330] DB change (Update jobs): 0.8 ms in transaction, 3.6 ms commit
    2023-04-17 15:55:13,284 - root - DEBUG - [webservice.py:330] DB change (Update execution): 0.3 ms in transaction, 3.8 ms commit

    Calling .commit() to finish the transaction seems to take longer than executing the SQL to insert/update rows. I think this makes sense - commit should flush the changes to disk, which will take time.

    If the 'in transaction' time was larger, that might indicate that it's waiting for the lock to modify the database. If the 'commit' time was much longer, that might mean GPFS hiccuped as we were trying to write.

  • merged

  • Thomas Kluyver mentioned in commit 9d3bde27

    mentioned in commit 9d3bde27

  • Philipp Schmidt changed milestone to %3.10.0

    changed milestone to %3.10.0

  • Please register or sign in to reply
    Loading