Development Workflow
We welcome contributions to the pipeline if you have calibration notebooks or algorithms that you believe could be useful. In order to facilitate the development process, we have provided a section that outlines the key points to consider during the development of new features. This section is designed to assist you throughout the development and review process, and ensure that your contributions are consistent with the pipeline's requirements. We believe that these guidelines will be helpful in creating a seamless development process and result in high-quality contributions that benefit the pipeline. If you have any questions or concerns regarding the development process, please do not hesitate to reach out to us for assistance. We look forward to working with you to enhance the pipeline's capabilities.
Developing a notebook from scratch
Developing a notebook from scratch can be a challenging but rewarding process. Here are some key steps to consider:
- Define the purpose
Start identifying what are you trying to solve and the task you want to perform with your notebook.
- Does the user need to execute the notebook interactively?
- Should it run the same way as the production notebooks? It is recommended that the notebook is executed in the same way as the production notebooks through xfel-calibrate CLI.
??? Note "`xfel-calibrate` CLI is essential"
If `xfel-calibrate` CLI is essential, you need to follow the guidelines in where and how to write the variables in the first notebook cell and how to include it as one of the CLI calibration options to execute.
- Does the notebook need to generate a report at the end to display its results or can it run without any user interaction?
??? Note "A report is needed"
If a report is needed you should make sure to provide sufficient guidance and textual details using markdown cells and clear prints within the code. You should also structure the notebook cells into appropriate subsections.
- Plan you work flow Map out the steps your notebook will take. From data ingestion to analyzing results and visualization.
- What are the required data sources that the notebook needs to access or utilize? For example, GPFS or calibration database.
- Can the notebook's internal concurrency be optimized through the use of multiprocessing or is it necessary to employ host-level cluster computing with SLURM to achieve higher performance?
??? Note "SLURM concurrency is needed"
If SLURM concurrency is needed, you need to identify the variable that the notebook will be replicated based on to split the processing.
- What visualization tools or techniques are necessary to provide an overview of the processing results generated by the notebook? Can you give examples of charts, graphs, or other visual aids that would be useful for understanding the output?
- Write the code and include documentation
Begin coding your notebook based on your workflow plan. Use comments to explain code blocks and decisions.
- [PEP 8](https://peps.python.org/pep-0008/) styling code is highly recommended. It leads to code that is easier to read, understand, and maintain. Additionally, it is a widely accepted standard in the Python community, and following it make your code more accessible to other developers and improve collaboration.
- [Google style docstrings](https://google.github.io/styleguide/pyguide.html) is our recommended way of documenting the code. By providing clear and concise descriptions of your functions and methods, including input and output parameters, potential exceptions, and other important details, you make it easier for other developers to understand the code, and for the used mkdocs documentation to [reference it](SUMMARY.md).
- Document the notebook and split into sections.
Enriching a notebook with documentation is an important step in creating a clear and easy-to-follow guide for others to use:
- Use Markdown cells to create titles and section headings: By using Markdown cells, you can create clear and descriptive headings for each section of your notebook. This makes it easier to navigate and understand the content of the notebook, but more importantly these are parsed while creating the PDF report using [sphinx][sphinx].
- Add detailed explanations to each section.
- Add comments to your code.
-
Test and refine Test your notebook thoroughly to identify any issues. Refine your code and documentation as needed to ensure your notebook is accurate, efficient, and easy to use.
-
Share and collaborate
Share your notebook on [GitLab](https://git.xfel.eu/) to start seeking feedback and begin the reviewing process.
Write notebook to execute using xfel-calibrate
To start developing a new notebook, you either create it in an existing detector directory or create a new directory for it with the new detector's name. Give it a suffix _NBC
to denote that it is enabled for the tool chain.
You should then start writing your code following these guidelines
- First markdown cell goes for Title, author, and notebook description. This is automatically parsed in the report.
- First code cell must have all parameter that will be exposed to
xfel-calibrate
CLI - Second code cell for importing all needed libraries and methods.
- The following code cells and markdown cells are for data ingestion, data processing, and data visualization. Markdown cells are very important as it will be parsed as the main source of report text and documentation after the calibration notebook is executed.
xfel-calibrate
Exposing parameters to The European XFEL Offline Calibration toolkit automatically deduces command line arguments from Jupyter notebooks. It does this with an extended version of nbparameterise, originally written by Thomas Kluyver.
Parameter deduction tries to parse all variables defined in the first code cell of a notebook. The following variable types are supported:
- Numbers(INT or FLOAT)
- Booleans
- Strings
- Lists of the above
You should avoid having import
statements in this cell. Line comments
can be used to define the help text provided by the command line interface, and to signify if lists can be constructed from ranges and if parameters are
required::
in_folder = "" # directory to read data from, required
out_folder = "" # directory to output to, required
metadata_folder = "" # directory containing calibration metadata file when run by xfel-calibrate
run = [820, ] # runs to use, required, range allowed
sequences = [0, 1, 2, 3, 4] # sequences files to use, range allowed
modules = [0] # modules to work on, required, range allowed
karabo_id = "MID_DET_AGIPD1M-1" # Detector karabo_id name
karabo_da = [""] # a list of data aggregators names, Default [-1] for selecting all data aggregators
skip-plots = False # exit after writing corrected files and metadata
The above are some examples of parameters from AGIPD correction notebook.
- Here,
in_folder
andout_folder
are set asrequired
string values.
Values for required parameters have to be given when executing from the command line. This means that any defaults given in the first cell of the code are ignored (they are only used to derive the type of the parameter).
-
modules
andsequences
are lists of integers, which from the command line could also be assigned using a range expression, e.g.5-10,12,13,18-21
, which would translate to5,6,7,8,9,12,13,18,19,20
.
!!! Warning
nbparameterise can only parse the mentioned subset of variable types. An expression that evaluates to such a type will not be recognized. e.g. a = list(range(3))
will not work!
-
karabo_id
is a string value indicating the detector to be processed. -
karabo_da
is a list of strings to indicate the detector's modules to be processed. Askarabo_da
andmodules
are two different variables pointing to the same physical parameter. In the later notebook cells both parameters are synced before usage. -
skip-plots
is a boolean for skipping the notebook plots to save time and deliver the report as soon as the data are processed. to setskip-plots
to False from the command line.--no-skip-plots
is used.
The table below provides a set of recommended parameter names to ensure consistency across all notebooks.
Parameter name | To be used for | Special purpose |
---|---|---|
in_folder |
the input path data resides in, usually without a run number. | |
out_folder |
path to write data out to, usually without a run number. | reports can be placed here |
metadata_folder |
directory path for calibration metadata file with local constants. | |
run(s) |
which XFEL DAQ runs to use, often ranges are allowed. | |
karabo_id |
detector karabo name to access detector files and constants. | |
karabo_da |
refers to detector's modules data aggregator names to process. | |
modules |
refers to the detector's modules indices to process, ranges often ok. | |
sequences |
sequence files for the XFEL DAQ system, ranges are often ok. | |
local_output |
write calibration constant from file, not database. | |
db_output |
write calibration constant from database, not file. | saves the database from unintentional constant |
injections |
developments or testing. |
External Libraries
You may use a wide variety of libraries available in Python, but keep in mind that others wanting to run the tool will need to install these requirements as well. Therefore::
-
It is generally advisable to avoid using specialized tools or libraries unless there is a compelling reason to do so. Instead, it is often better to use well-established and widely-accepted alternatives that are more likely to be familiar to other developers and easier to install and use. For example, when creating visualizations, it is recommended to use the popular and widely-used library, matplotlib for charts, graphs and other visualisation. Similarly, numpy is widely used when performing numerical processing tasks.
-
When developing software, it is important to keep in mind the runtime and library requirements for your application. In particular, if you are using a library that performs its own parallelism, you will need to ensure that it can either set up this parallelism programmatically or do so automatically. If you need to start your application from the command line, there may be additional challenges to consider.
-
Reading out EXFEL RAW data is encouraged to be done using extra_data. This tool is designed to facilitate efficient access to data structures stored in HDF5 format. By simplifying the process of accessing RAW or CORRECTED datasets, it allows users to quickly and easily select and filter the specific trains, cells, or pixels of interest. This can greatly reduce the complexity and time required for data analysis, and enables researchers to more effectively explore and analyze large datasets.
Writing out data
If your notebook produces output data, consider writing data out as early as possible, such that it is available as soon as possible. Detailed plotting and inspection can be done later on in the notebook.
Also use HDF5 via h5py as your output format. If you correct or calibrate input data, which adheres to the XFEL naming convention, you should maintain the convention in your output data. You should not touch any data that you do not actively work on and should assure that the INDEX
and identifier entries are synchronized with respect to your output data. E.g. if you remove pulses from a train, the INDEX/.../count
section should reflect this. cal_tools.files
module helps you achieve this easily.
Plotting
When creating plots, make sure that the plot is either self-explanatory or add markdown comments with adequate description. Do not add "free-floating" plots, always put them into a context. Make sure to label your axes.
Also make sure the plots are readable on an A4-sized PDF page; this is the format the notebook will be rendered to for report outputs. Specifically, this means that figure sizes should not exceed approx 15x15 inches.
The report will contain 150 dpi PNG images of your plots. If you need higher quality output of individual plot files you should save these separately, e.g. via fig.savefig(...)
yourself.
xfel-calibrate execution
The package utilizes tools such as nbconvert and nbparameterise to expose Jupyter notebooks to a command line interface. In the process reports are generated from these notebooks.
The general interface is:
% xfel-calibrate DETECTOR TYPE
where DETECTOR
and TYPE
specify the task to be performed.
Additionally, it leverages the DESY/XFEL Maxwell cluster to run these jobs in parallel via SLURM.
Here is a list of available_notebooks.
Interaction with the calibration database
During development, it is advised to work with local constant files first before injecting any calibration constants to the production database. After the notebook's algorithms arguments matured one can switch over to the test database and then production database. The reason for this is to avoid injecting wrong constants that can affect production calibration. And to avoid unnecessary intervention to disable wrong or unused injected calibration constants.
Additionally, the calibration database is limited to XFEL networks, so independent development improves the workflow.
Testing
The most important test is that your notebook completes flawlessly outside any special tool chain feature. After all, the tool chain will only replace parameters, and then launch a concurrent job and generate a report out of notebook. If it fails to run in the normal Jupyter notebook environment, it will certainly fail in the tool chain environment.
Once you are satisfied with your current state of initial development, you can add it to the list of notebooks as mentioned in the configuration section.
Any changes you now make in the notebook will be automatically propagated to the command line. Specifically, you should verify that all arguments are parsed correctly, e.g. by calling::
xfel-calibrate DETECTOR NOTEBOOK_TYPE --help
From then on, check include if parallel SLURM jobs are executed correctly and if a report is generated at the end.
Finally, you should verify that the report contains the information you'd like to convey and is intelligible to people other than you.
???+ note
You can run the `xfel-calibrate` command without starting a [SLURM][slurm] cluster job, giving you direct access to console output, by adding the `--no-cluster-job` option.
Documenting
Most documentation should be done in the notebook itself. Any notebooks specified in the notebook.py
file will automatically show up in the Available Notebooks section of this documentation.