Skip to content
Snippets Groups Projects

FIX - Recreate folder if sphinx-rep already existed.

Merged Karim Ahmed requested to merge fix/sphix-rep_already_exists into master
3 unresolved threads

A fix for an error that happened regularly lately on production calibration requests. i.e.

Waiting on jobs to finish: ['8955633']
Convert svg to pdf and png
Prepare timing summary
Traceback (most recent call last):
  File "/home/xcal/deployments/pycalibration-270821/temp/slurm_out_JUNGFRAU_CORRECT_t210919_084629/finalize.py", line 5, in <module>
    finalize(joblist=['8955633'],
  File "/home/xcal/deployments/pycalibration-270821/src/xfel_calibrate/finalize.py", line 430, in finalize
    sphinx_path = combine_report(run_path, calibration)
  File "/home/xcal/deployments/pycalibration-270821/src/xfel_calibrate/finalize.py", line 43, in combine_report
    makedirs(sphinx_path)
  File "/home/xcal/.pyenv/versions/3.8.11/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/home/xcal/deployments/pycalibration-270821/temp/slurm_out_JUNGFRAU_CORRECT_t210919_084629/sphinx_rep'

Description

How Has This Been Tested?

Running an epix correction and cancelling the finalize job after creating the sphinx-rep folder, then rerunning finalize.py again.

Relevant Documents (optional)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

Reviewers

@calibration

Edited by Karim Ahmed

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Karim Ahmed changed the description

    changed the description

    • I'm sorry if I forgot, did we figure out why this happens in the first place?

    • No, I tried to observe it last Friday and it was happening to random requests.

    • I am not sure if we can inspect if 8955633 was preempted for whatever reason.

      I know that xcal jobs are not preempted as we use these different partitions. But the only reason I can think of is that the job was restarted before finishing building the report.

    • Ah yes, I think we suspected preemption. Well, if sphinx does not mind for existing files, this fix makes sense.

      LGTM

    • I think it's possible for the finalize job (building the report) to be preempted - since some refactoring I did, this runs in the exfel partition where it can use nodes with smaller amounts of RAM. It's a bit surprising if it often preempts short-running (few minutes) jobs, but maybe it does.

      I think if we use --open-mode append, we should be able to see in the slurm-*.out logs when something was preempted and requeued. By default, the log file is overwritten when it starts again.

    • SLURM might also rather preempt a short-running task than a longer one with the idea it'll execute quickly in the future and/or little time is already spent on it. But each is anecdotical speculation :grinning:

    • I've opened !567 (merged) to append to the output file.

    • Please register or sign in to reply
    • Well, if sphinx does not mind for existing files, this fix makes sense.

      Uhm, it is not LGTM. So this was not tested in the assumption that no such thing would happen. I was wrong. Thanks for that comment.

      Sphinx indeed mind that. I guess I will delete the folder if it existed

      Edited by Karim Ahmed
    • Out of curiosity, what exactly was the problem?

    • Please register or sign in to reply
  • Karim Ahmed added 1 commit

    added 1 commit

    • ef5e01d4 - remove if sphinx path existed

    Compare with previous version

  • Karim Ahmed added 1 commit

    added 1 commit

    • abb59df3 - remove if sphinx path existed

    Compare with previous version

  • MR is updated.

  • Karim Ahmed changed the description

    changed the description

  • Karim Ahmed added 2 commits

    added 2 commits

    • e7492917 - fix error 'Notebook cannot be run concurrently: no sequences parameter'
    • a777decd - Merge branch 'fix/concurreny_parameter_available' into 'fix/sphix-rep_already_exists'

    Compare with previous version

  • Karim Ahmed added 2 commits

    added 2 commits

    • e7492917 - fix error 'Notebook cannot be run concurrently: no sequences parameter'
    • a777decd - Merge branch 'fix/concurreny_parameter_available' into 'fix/sphix-rep_already_exists'

    Compare with previous version

    • (Re-)LGTM

      It looks like you did not update your master in the meantime and then somehow pulled it in your branch :thinking:

    • Error: specified path is not a directory, or sphinx files already exist.
      sphinx-quickstart only generate into a empty directory. Please specify a new root path.
      Command '['/home/ahmedk/calibration2/.cal2_venv/bin/python', '-m', 'sphinx.cmd.quickstart', '--quiet', "--project='EPIX100 CORRECT Calibration'", "--author='anonymous'", '-v', '3.4.2-1-ga5f0cd4', '--suffix=.rst', '--master=index', '--ext-intersphinx', '--ext-mathjax', '--makefile', '--no-batchfile', PosixPath('/home/ahmedk/calibration2/pycalibration/temp/slurm_out_EPIX100_CORRECT_t210921_103837/sphinx_rep')]' returned non-zero exit status 1.
      Traceback (most recent call last):
        File "/home/ahmedk/calibration2/pycalibration/src/xfel_calibrate/finalize.py", line 225, in make_report
          check_call([sys.executable, "-m", "sphinx.cmd.quickstart",
        File "/gpfs/exfel/sw/calsoft/.pyenv/versions/3.8.11/lib/python3.8/subprocess.py", line 364, in check_call
          raise CalledProcessError(retcode, cmd)
      subprocess.CalledProcessError: Command '['/home/ahmedk/calibration2/.cal2_venv/bin/python', '-m', 'sphinx.cmd.quickstart', '--quiet', "--project='EPIX100 CORRECT Calibration'", "--author='anonymous'", '-v', '3.4.2-1-ga5f0cd4', '--suffix=.rst', '--master=index', '--ext-intersphinx', '--ext-mathjax', '--makefile', '--no-batchfile', PosixPath('/home/ahmedk/calibration2/pycalibration/temp/slurm_out_EPIX100_CORRECT_t210921_103837/sphinx_rep')]' returned non-zero exit status 1.
    • Please register or sign in to reply
  • Karim Ahmed added 3 commits

    added 3 commits

    Compare with previous version

  • Thanks for the review!

  • merged

  • Karim Ahmed mentioned in commit 37f9f381

    mentioned in commit 37f9f381

  • Philipp Schmidt changed milestone to %3.4.3

    changed milestone to %3.4.3

Please register or sign in to reply
Loading