Compare revisions

Karim Ahmed · Karim Ahmed · Karim Ahmed · Karim Ahmed · Karim Ahmed · Karim Ahmed
--- a/docs/operation/calibration_database.md
+++ b/docs/operation/calibration_database.md
@@ -103,7 +103,8 @@ constants files)

 PDUs can be transferred between detectors and instruments or replaced. To correct raw data, the calibration pipeline produces calibration constants, which are specific to the module they were generated from. As a result, it's crucial to map the hardware to the corresponding software modules.

-As the detector is 
+It's possible to update a detector mapping through CalCat in case the PDUs are already available. i.e. moving a PDU from one detector to another. Otherwise, only users with admin access can add entirely new PDUs.
+![detector_pdu_mapping_calcat](../static/calcat/detector_mapping_edit.png)


 ### Modifying detector mapping

--- a/docs/operation/detector_specific_troubleshooting.md
+++ b/docs/operation/detector_specific_troubleshooting.md
+## Detector Specific troubleshooting
+
+
+## AGIPD
+
+### No corrected data even though there are available trains.
+
+It has been reported more than once through the instruments and DOC that there offline correction showed didn't produce any data. And after the DOC checked the reports there is warning for no trains available to correct for all modules, therefore no processed data available.
+
+This can happen in case the offline calibration is configured to correct illuminated data based on the LitFrameFinder device. This device points to the calibration pipeline which frames are illuminated to process. In case the data didn't have any no data correction will take place and no processed files will be produced.
+
+![No LitFrames](../static/troubleshooting/no_litframes.png)
+
+One of the reason for this can be that there were no beam at the time when the data was acquired. This is something the instrument scientists should be aware of and it is usually possible to check the shutter configuration switches for this particular run. As the shutter switches names can be changed. It wont be included here to avoid giving outdated dataset key names.
+
+Below is a reason a table on the shutter switches state for a run acquired with no beam.
+
+![No beam](../static/troubleshooting/shutter_run_state.png)
+
+
+<!-- ## DSSC
+
+## Epix100 -->
+
+## Gotthard2
+
+
+### GH2-25um: Correction fails No trains to correct error in PDF report.
+![GH2 25um No trains to correct](../static/troubleshooting/gh2_25um_no_trains_to_correct.png)
+
+Check the corresponding sequence file using similar to what is mentioned [in the general troubleshooting section](./troubleshooting.md#processing-failed-and-report-mentions-no-trains-to-process)
+
+For 25um it is expected to find data for two modules (two separate receivers). Below is an example
+when a run had missing trains for one of the two modules. This is the out of `h5glance`
+
+```bash
+module load exfel exfel-python
+h5glance /gpfs/exfel/exp/SPB/202401/p005476/raw/r0020/RAW-R0020-GH200-S00000.h5 INSTRUMENT/SA1_XTD9_HIREX/DET
+```
+
+```bash
+/gpfs/exfel/exp/SPB/202401/p005476/raw/r0020/RAW-R0020-GH200-S00000.h5/INSTRUMENT/SA1_XTD9_HIREX/DET
+├GOTTHARD2_MASTER:daqOutput
+│ └data (3 attributes)
+│   ├adc	[uint16: 100 × 2720 × 1280] (4 attributes)
+│   ├bunchId	[uint64: 100 × 2720] (17 attributes)
+│   ├frameNumber	[uint64: 100 × 2720] (17 attributes)
+│   ├gain	[uint8: 100 × 2720 × 1280] (4 attributes)
+│   ├memoryCell	[uint8: 100 × 2720] (17 attributes)
+│   ├timestamp	[float64: 100 × 2720] (17 attributes)
+│   └trainId	[uint64: 100] (1 attributes)
+└GOTTHARD2_SLAVE:daqOutput
+  └data (3 attributes)
+    ├adc	[uint16: 0 × 2720 × 1280] (4 attributes)
+    ├bunchId	[uint64: 0 × 2720] (17 attributes)
+    ├frameNumber	[uint64: 0 × 2720] (17 attributes)
+    ├gain	[uint8: 0 × 2720 × 1280] (4 attributes)
+    ├memoryCell	[uint8: 0 × 2720] (17 attributes)
+    ├timestamp	[float64: 0 × 2720] (17 attributes)
+    └trainId	[uint64: 0] (1 attributes)
+```
+
+## Jungfrau
+
+### No constants retrieved for burst mode data
+
+Burst mode operates solely at fixed gain. This is realized either by fixed medium gain, or adaptive gain under the assumption of low illumination, thus being equivalent to fixed high gain with an additional safety margin. Therefore, we decided to only incorporate only fixed gain constants for burst mode.
+
+The middleware tasked with executing dark runs has been updated to reflect this operational adjustment. Nonetheless, there have been occurrences where dark runs were manually acquired in instruments, creating a potential for error by injecting burst mode constants in adaptive mode. This misstep could result in either the failure to retrieve constants, as the anticipated fixed gain constants were not injected, or the retrieval of outdated constants, leading to incorrect corrections.
+
+<!-- ## LPD
+
+## PnCCD -->
--- a/docs/operation/myMDC.md
+++ b/docs/operation/myMDC.md
@@ -41,17 +41,20 @@ The main important columns in the `Runs` page are:

 ??? Warning "Dark runs are skipped."

-    Dark runs are not calibrated. In case a manual request was done. The calibration status will be in error state. 
+    Dark runs are not calibrated by default. In case a manual request was done. The calibration status will be in error state. 


-![!Run correction status](../static/myMDC/run_9037_general_status.png){align=right width=240}  
+![!Run correction status](../static/myMDC/run_72_general_status.png){align=right width=240}  


 To check more details about a specific run like its ID, size, number of files, first and last train Ids, clicking on the run number will convert you the run page.

 This page is very useful in checking the correction status, as both `Calibrated Process data status` and `Calibration Pipeline auto Reply` keys are updated regularly by the calibration service until a correction is finished.

-For example if a run correction request failed. `Calibration Pipeline auto Reply` can have very useful information on the issue and ideally how to solve it.  
+For example if a run correction request failed. `Calibration Pipeline auto Reply` can have very useful information on the issue and ideally how to solve it.
+
+Additionally, `Processing Reports` tab have all generated PDF reports for every calibration attempt.
+![!Processing Reports](../static/myMDC/processing_reports.png)

 ---------------------------------------------------


--- a/docs/operation/troubleshooting.md
+++ b/docs/operation/troubleshooting.md
+# Troubleshooting
+
+## Calibration (correct or dark) request failed:
+
+1. Check if there is an available PDF report.
+2. In case report exists, open it and check what kind of error is available in the PDF report.
+3. There is no available report, next step would be checking the logs for the [calibration webservice](max-exfl-cal001.desy.de:8008). The [webservice](operation/webservice.md) can be accessed with the access to maxwell cluster.
+
+![webservice logs](../static/webservice/webservice_log.png)
+![jobmonitor service logs](../static/webservice/job_monitor_log.png)
+
+
+## Processing failed and report mentions no trains to process.
+1. Validate the raw data by checking the number of available trains.
+2. `h5glance` can be a useful tool for this purpose.
+    ```bash
+    module load exfel exfel-python
+    h5glance `<A-RAW-H5-file>`.h5
+    ```
+    - Check if the output shows a shape of (0, ...) for the datasets in detector INSTRUMENTS data group.
+
+## Processing failed and report shows an error
+
+1. Validate the raw data and that it has no unexpected datasets or train mismatches.
+2. `extra-data-validate` can be a useful tool for doing this.
+    ```bash
+    module load exfel exfel-python
+    extra-data-validate /gpfs/exfel/exp/{instrument}/{cycle}/{proposal}/raw/r{run-number}
+    ```
+3. In case data was not validated report the issue to the DOC and mention the reason for the failed validation.
+4. In case the data was validated, report the issue to CAL to investigate.
+
+
+## Slow calibration processing
+
+It can happen that an instrument reports an un usual slowness in processing. It is essential to differentiate between slowness in processing the data after the request was directly triggered or if the instrument are receiving a report too late after myMDC should have triggered a calibration.
+
+This is important because there can be different issues that will need different groups to follow up on.
+
+### Data migration takes too long
+
+- The calibration webservice obviously shouldn't start any calibration until the data has been migrated from `ONC` to `/gpfs/`.
+
+In multiple instances there were migration issues, either because of a pileup because of how often small runs are acquired and migrated in a proposal or because of a specific issue that ITDM needs to investigate.
+
+To confirm that the slowness in calibration is related to slow migration, one can check the [calibration webservice log](#calibration-correct-or-dark-request-failed) through the [webservice overview webpage](webservice.md#webservice-overview) or if you have access to `xcal@max-exfl-cal001.desy.de` to check the log files in the running deployed pycalibration instance. Below is an example of logs showing many tries for the webservice to check if data was migrated for run 22 to start offline correction.
+
+```bash
+2023-06-07 12:16:00,161 - root - INFO - [webservice.py:351] python -m xfel_calibrate.calibrate agipd CORRECT --slurm-scheduling 1568 --slurm-partition upex-middle --slurm-mem 700 --request-time 2023-06-07T12:13:38 --slurm-name correct_MID_agipd_202301_p003493_r22 --report-to /gpfs/exfel/exp/MID/202301/p003493/usr/Reports/r22/MID_DET_AGIPD1M-1_correct_003493_r22_230607_121600 --cal-db-timeout 300000 --cal-db-interface tcp://max-exfl016:8015#8044 --ctrl-source-template {}/MDL/FPGA_COMP --karabo-da AGIPD00 AGIPD01 AGIPD02 AGIPD03 AGIPD04 AGIPD05 AGIPD06 AGIPD07 AGIPD08 AGIPD09 AGIPD10 AGIPD11 AGIPD12 AGIPD13 AGIPD14 AGIPD15 --karabo-id-control MID_EXP_AGIPD1M1 --receiver-template {}CH0 --compress-fields gain mask data --recast-image-data int16 --round-photons --use-litframe-finder auto --use-super-selection final --use-xgm-device SA2_XTD1_XGM/XGM/DOOCS --adjust-mg-baseline --bias-voltage 300 --blc-set-min --blc-stripes --cm-dark-fraction 0.15 --cm-dark-range -30 30 --cm-n-itr 4 --common-mode --ff-gain 1.0 --force-hg-if-below --force-mg-if-below --hg-hard-threshold 1000 --low-medium-gap --mg-hard-threshold 1000 --overwrite --rel-gain --sequences-per-node 1 --slopes-ff-from-files --xray-gain --max-tasks-per-worker 1 --in-folder /gpfs/exfel/exp/MID/202301/p003493/raw --out-folder /gpfs/exfel/d/proc/MID/202301/p003493/r0022 --karabo-id MID_DET_AGIPD1M-1 --run 22
+2023-06-07 12:15:59,918 - root - INFO - [webservice.py:517] Transfer complete: proposal 003493, runs ['22']
+2023-06-07 12:15:49,810 - root - INFO - [webservice.py:471] Proposal 003493 run 22 not migrated yet. Will try again (13/300)
+2023-06-07 12:15:39,713 - root - INFO - [webservice.py:471] Proposal 003493 run 22 not migrated yet. Will try again (12/300)
+2023-06-07 12:15:29,624 - root - INFO - [webservice.py:471] Proposal 003493 run 22 not migrated yet. Will try again (11/300)
+2023-06-07 12:15:19,519 - root - INFO - [webservice.py:471] Proposal 003493 run 22 not migrated yet. Will try again (10/300)
+2023-06-07 12:15:09,416 - root - INFO - [webservice.py:471] Proposal 003493 run 22 not migrated yet. Will try again (9/300)
+2023-06-07 12:14:59,323 - root - INFO - [webservice.py:471] Proposal 003493 run 22 not migrated yet. Will try again (8/300)
+2023-06-07 12:14:49,225 - root - INFO - [webservice.py:471] Proposal 003493 run 22 not migrated yet. Will try again (7/300)
+2023-06-07 12:14:39,121 - root - INFO - [webservice.py:471] Proposal 003493 run 22 not migrated yet. Will try again (6/300)
+2023-06-07 12:14:29,015 - root - INFO - [webservice.py:471] Proposal 003493 run 22 not migrated yet. Will try again (5/300)
+2023-06-07 12:14:18,917 - root - INFO - [webservice.py:471] Proposal 003493 run 22 not migrated yet. Will try again (4/300)
+2023-06-07 12:14:08,814 - root - INFO - [webservice.py:471] Proposal 003493 run 22 not migrated yet. Will try again (3/300)
+2023-06-07 12:13:58,723 - root - INFO - [webservice.py:471] Proposal 003493 run 22 not migrated yet. Will try again (2/300)
+2023-06-07 12:13:48,623 - root - INFO - [webservice.py:471] Proposal 003493 run 22 not migrated yet. Will try again (1/300)
+2023-06-07 12:13:38,506 - root - INFO - [webservice.py:471] Proposal 003493 run 22 not migrated yet. Will try again (0/300)
+
+```
+
+### Allocated jobs are in pending state
+
+![pending correction jobs](../static/troubleshooting/pending_jobs.png)
+
+There are two partitions used by the offline calibration for Active proposals. In these partitions `xcal` has a high priority, hence if all nodes are occupied by different users, `xcal` would be able to take over the node.
+
+This helps in avoiding the issue of not finding resources during user experiments to run dark processing or corrections. However in some cases it would be possible to get a call about `PENDING` calibration jobs for too long.
+
+1. One reason would be that the `upex-middle` (or `upex-high` for darks), which is used for offline correction has all resources occupied by other calibration jobs. This can happen if multiple run corrections were requested for an `ACTIVE` proposal, leading to the delay of another runs from another or same instrument. In case it is another instrument DOC will need to coordinate with both instruments and one solution would be to stop the corrections acquiring all resources if they aren't urgent compared to corrections for another instrument.
+
+2. Another reason can be that neither `upex-middle` nor `upex-high` are used for the triggered calibrations. This can be because the runs to be calibrated doesn't belong to an `ACTIVE` proposal. Either because the proposal has finished it's `ACTIVE` time window, or data was acquired for this proposal before the expected duration when the proposal should start to be `ACTIVE`. Check [calibration partitions](webservice.md#upex-calibration-partitions) for more details.
+
+## Correction 
+
+#### Correction is taking longer than expected (I/O delay)
+
+In case the correction was properly started from myMDC and related jobs are not pending for a long time but rather are processing for longer than expected i.e. in reference to other runs previously corrected in the same proposal. One reason could be is that the data for this run was moved from fast access gpfs to dCache. This movement is expected for proposals after a time window for finished proposals to leave space for new data and active proposals.
+
+Data in dCache has a longer I/O time and for data with many sequence files, data processing can be affected.
+
+To check if the data is on dCache, myMDC can be used.
+![mymdc_repositories](../static/myMDC/repositories.png)
+
+
+This image show that the runs are on gpfs and dCache, there can be other proposals which have some or all runs only on dCache.
+
+#### Correction failed no constants found
+
+For most of the detectors in case the offset dark constant was not retrieved, the correction will not go through.
+
+<!-- ## Dark processing -->
+
+
+
+
--- a/docs/operation/webservice.md
+++ b/docs/operation/webservice.md
@@ -13,7 +13,7 @@ Beside forming and executing the CL for the corresponding detector and calibrati

 The webservice uses SQLite database to store and keep track of the requests, executions, and calibration Slurm jobs.

-![job database](../static/webservice_job_db.png)
+![job database](../static/webservice/webservice_job_db.png)

 As can be seen, there are three tables. Executions, Slurm jobs, Requests.

@@ -25,8 +25,36 @@ Users can generate calibration constants using [myMDC](myMDC.md#calibration-cons

 Users can trigger offline calibration through [myMDC](myMDC.md#triggering-offline-correction). The webservice would handle this remote request via ZMQ and start with assigning this request into the [Job DB](#job-database). Next step would be reading configurations for the correction xfel-calibrate CLI and launch the correction after confirming that the RAW data is migrated, otherwise the correction waits until the transfer is complete. By default the corrections are disabled and skipped for dark run types. The webservice replies to [myMDC](myMDC.md) with a success message that the correction is launched or with an error message if the correction was not launched. 

+## upex calibration partitions
+
+As the webservice is the main entry for all calibration requests during operation and user experiments. It is essential to be able to manage resource acquirement to different proposals and calibration actions.
+
+Previously, there was a several tried options to provide the urgent proposals with available nodes and avoid delaying a crucial dark processing for an experiment because other correction jobs and users are allocating the current maxwell nodes. Node reservations were used as well as having scripts from ITDM side to increase slurm priority for jobs running with `xcal` user.
+
+The current solution for sometime has been having two `upex` partitions with a priority given to `xcal` user, and the possibility to take over nodes from different users if no other node is available in the used partition.
+
+The two partitions are `upex-middle` and `upex-high`, which are used for correction and dark calibrations respectively. The idea is to launch slurm jobs using one of these partitions in case the calibrated run belongs to an `ACTIVE` proposal.
+
+
+As shown below this is a screenshot showing proposal 5438 when it had a `Beamline status` as `Active`
+![Active](../static/webservice/mymdc_active_proposal_5438.png)
+

 ## job monitor

-The Job DB is regularly monitored by a dedicated service
+The Job DB is regularly monitored by a separate and a dedicated service. Below is a screenshot for it's logs through the [webservice overview page](#webservice-overview).
+
+![webservice job monitor](../static/webservice/webservice_job_db.png)
+
+
+## Webservice Overview webpage
+
+This is the main webservice webpage `max-exfl-cal001.desy.de:8008`. Through it one can have an overview of the latest webservice activity:
+
+- Running calibration jobs and access.
+- The latest dark processing reports. It is possible to access all dark processing reports for all detectors.
+- The latest correction reports.
+- The main webservice logs.
+- The [job monitor](#job-monitor) logs.

+This webpage can be accessed only on Maxwell. And it is a separate service which runs along the webservice.
--- a/docs/static/calcat/detector_mapping_edit.png
+++ b/docs/static/calcat/detector_mapping_edit.png
--- a/docs/static/myMDC/processing_reports.png
+++ b/docs/static/myMDC/processing_reports.png
--- a/docs/static/myMDC/repositories.png
+++ b/docs/static/myMDC/repositories.png
--- a/docs/static/myMDC/run_72_general_status.png
+++ b/docs/static/myMDC/run_72_general_status.png
--- a/docs/static/myMDC/run_9037_general_status.png
+++ b/docs/static/myMDC/run_9037_general_status.png
--- a/docs/static/troubleshooting/gh2_25um_no_trains_to_correct.png
+++ b/docs/static/troubleshooting/gh2_25um_no_trains_to_correct.png
--- a/docs/static/troubleshooting/missing_data_second_gh2_module.png
+++ b/docs/static/troubleshooting/missing_data_second_gh2_module.png
--- a/docs/static/troubleshooting/no_litframes.png
+++ b/docs/static/troubleshooting/no_litframes.png
--- a/docs/static/troubleshooting/pending_jobs.png
+++ b/docs/static/troubleshooting/pending_jobs.png
--- a/docs/static/troubleshooting/shutter_run_state.png
+++ b/docs/static/troubleshooting/shutter_run_state.png
--- a/docs/static/webservice/job_monitor_log.png
+++ b/docs/static/webservice/job_monitor_log.png
--- a/docs/static/webservice/mymdc_active_proposal_5438.png
+++ b/docs/static/webservice/mymdc_active_proposal_5438.png
--- a/docs/static/webservice_job_db.png
+++ b/docs/static/webservice_job_db.png
--- a/docs/static/webservice/webservice_log.png
+++ b/docs/static/webservice/webservice_log.png
--- a/docs/xfel-calibrate_cli_process_no_caldbremote.png
+++ b/docs/xfel-calibrate_cli_process_no_caldbremote.png
No results found