-
Thomas Kluyver authoredThomas Kluyver authored
Configuration
The European XFEL Offline Calibration tools are configure using the settings.py and notebooks.py files. Both can be found in the root directory. The settings.py file configures the tools to the environment. The notebook.py file configures the notebooks which should be exposed to the command line.
Settings
The settings.py file configures the enviroment the tools are run in. It is a normal python file of the form:
# path into which temporary files from each run are placed
temp_path = "{}/temp/".format(os.getcwd())
# Path to use for calling Python. If the environment is correctly set, simply the command
python_path = "python"
# Path to store reports in
report_path = "{}/calibration_reports/".format(os.getcwd())
# Also try to output the report to an out_folder defined by the notebook
try_report_to_output = True
# the command to run this concurrently. It is prepended to the actual call
launcher_command = "sbatch -p exfel -t 24:00:00 --mem 500G --mail-type END --requeue --output {temp_path}/slurm-%j.out"
A comment is given for the meaning of each configuration parameter.
Notebooks
The xfel-calibrate tool will expose any notebooks that are configured here to the command line by automatically parsing the parameters given in the notebooks first cell. The configuration is to be given in form of a python directory:
notebooks = {
"AGIPD": {
"DARK": {
"notebook": "AGIPD/Characterize_AGIPD_Gain_Darks_NBC.ipynb",
"concurrency": {"parameter": "modules",
"default concurrency": 16,
"cluster cores": 16},
},
"PC": {
"notebook": "AGIPD/Chracterize_AGIPD_Gain_PC_NBC.ipynb",
"concurrency": "parameter": "modules",
"default concurrency": 16,
"cluster cores": 16},
},
"CORRECT": {
"notebook": "notebooks/AGIPD/AGIPD_Correct_and_Verify.ipynb",
"concurrency": {"parameter": "sequences",
"use function": "balance_sequences",
"default concurrency": [-1],
"cluster cores": 32},
...
}
}
The first key is the detector that the calibration may be used for, here AGIPD. The second key level gives the name of the task being performed (here: DARK and PC). For each of these entries, a path to the notebook and a concurrency hint should be given. In the concurrency hint the first entry specifies which parameter of the notebook expects a list whose integer entries, can be concurrently run (here "modules"). The second parameter state with which range to fill this parameter if it is not given by the user. In the example a range(16):=0,1,2,...15 would be passed onto the notebook, which is run as 16 concurrent jobs, each processing one module. Finally, a hint for the number of cluster cores to be started should be given. This value should be derived e.g. by profiling memory usage per core, run times, etc.
Note
It is good practice to name command line enabled notebooks with an _NBC suffix as shown in the above example.
The CORRECT notebook (last notebook in the example) makes use of a concurrency generating function by setting the use function parameter. This function must be defined in a code cell in the notebook, its parameters should be named like other exposed parameters. It should return a list of of parameters to be inserted into the concurrently run notebooks. The example given e.g. defines the balance_sequences function:
def balance_sequences(in_folder, run, sequences, sequences_per_node):
import glob
import re
import numpy as np
if sequences_per_node != 0:
sequence_files = glob.glob("{}/r{:04d}/*-S*.h5".format(in_folder, run))
seq_nums = set()
for sf in sequence_files:
seqnum = re.findall(r".*-S([0-9]*).h5", sf)[0]
seq_nums.add(int(seqnum))
seq_nums -= set(sequences)
return [l.tolist() for l in np.array_split(list(seq_nums),
len(seq_nums)//sequences_per_node+1)]
else:
return sequences
Note
Note how imports are inlined in the definition. This is necessary, as only the function code, not the entire notebook is executed.
which requires as exposed parameters e.g.
in_folder = "/gpfs/exfel/exp/SPB/201701/p002038/raw/" # the folder to read data from, required
run = 239 # runs to process, required
sequences = [-1] # sequences to correct, set to -1 for all, range allowed
sequences_per_node = 2 # number of sequence files per cluster node if run as slurm job, set to 0 to not run SLURM parallel
Note
The function only needs to be defined, but not executed within the notebook context itself.