More documentation

090f28ee · Steffen Hauf · ce1103ac · 090f28ee · 090f28ee
Commit 090f28ee authored 6 years ago by Steffen Hauf
--- a/LPD/Investigate_Non_Linear_Transition.ipynb
+++ b/LPD/Investigate_Non_Linear_Transition.ipynb
--- a/docs/source/workflow.rst
+++ b/docs/source/workflow.rst
 Development Workflow
 ====================
+
+The following walkthrough will guide you through a possible workflow
+when developing new offline calibration tools.
+
+Fresh Start
+-----------
+
+If you are starting a blank notebook from scratch you should first 
+think about a few preconsiderations:
+
+* Will the notebook performan a headless task, or will it also be 
+  an important interface for evaluating the results in form of a 
+  report.
+* Do you need to run concurrently? Is concurrency handled internally,
+  e.g. by use of ipcluster, or also on a host level, using cluster 
+  computing via slurm.
+
+In case you plan on using the notebook as a report tool, you should make
+sure to provide sufficient guidance and textual details using e.g. markdown
+cells in the notebook. You should also structure it into appropriate 
+subsections.
+
+If you plan on running concurrently on the cluster, identify which variable
+should be mapped to concurent runs. For autofilling it an integer list is 
+needed.
+
+Once you've clarified the above points, you should create a new notebook,
+either in an existing detector folder, or if for a yet not integrated 
+detector, into a new folder with the detector's name. Give it a suffix
+`_NBC` to denote that it is enabled for the tool chain.
+
+You should then start writing your code following the guidelines 
+below.
+
+
+From Existing Notebook
+----------------------
+
+Copy your existing notebook into the appropriate detector directory,
+or create a new one if the detector does not exist yet. Give the copy 
+a suffix `_NBC` to denote that it is enabled for the tool chain. 
+
+You should then start restructuring your code following the guidelines 
+below.
+
+Title and Author Information
+----------------------------
+
+Especially for report generation the notebook should have a proper title
+author and version. These should be given in a leading markdown cell in
+the form::
+
+    # My Fancy Calculation #
+    
+    Author: Jane Doe, Version 0.1
+    
+    A description of the notebook.
+    
+Information in the format will allow automatic parsing of author and version.
+
+
+Exposing Parameters to the Command Line
+---------------------------------------
+
+The European XFEL Offline Calibration toolkit automatically deduces
+command line arguments for Jupyter notebooks. It does this with an
+extended version of nbparameterise_, originally written by Thomas
+Kluyver.
+
+Parameter deduction tries to parse all variables defined in the first
+code cell of a notebook. The following variable types are supported:
+
+* numbers: ints and floats
+* Booleans
+* strings
+* lists of any of the above
+
+You should avoid having `import` statements in this cell. Line comments
+can be used to define the help text provided by the command line interface,
+and to signify if lists can be constructed from ranges and if paramters are
+required::
+
+    in_folder = '/gpfs/exfel/exp/SPB/201830/p900019/raw' # path to input data, required
+    modules = [0] # modules to work on, required, range allowed
+    out_folder = "/gpfs/exfel/exp/SPB/201830/p900019/proc/calibration0618/FF" # path to output to, required
+    runs = [820,] # runs to use, required, range allowed
+    sequences = [0,1,2,3,4] # sequences files to use, range allowed
+    cluster_profile = "noDB" # The ipcluster profile to use
+    local_output = True # output constants locally
+    
+Here, `in_folder` and `out_folder` are required string values. `Modules` is a list, which
+from the command line could also be assigned using a range expression, e.g. `5-10,12,13,18-21`,
+which would translate to `5,6,7,8,9,12,13,18,19,20`. It is also a required parameter.
+The parameter `local_output` is a Boolean.
+
+The `cluster_profile` parameter is a bit special, in that the tool kit expects exactly this
+name to provide the profile name for an `ipcluster_` being run. Hence you use `ipcluster`
+for parallelisation, define your profile name in this variable.
+
+The excerpt above is from a flat field characterization notebook for AGIPD. The code would lead
+to the following parameters being exposed via the command line::
+
+    % python calibrate_nbc.py AGIPD FF --help
+    usage: calibrate_nbc.py [-h] --in-folder str [--modules str [str ...]]
+                            --out-folder str --runs str [str ...]
+                            [--sequences str [str ...]] [--cluster-profile str]
+                            [--local-output] [--db-output] [--bias-voltage int]
+                            [--cal-db-interface str] [--mem-cells int]
+                            [--interlaced] [--fit-hook] [--rawversion int]
+                            [--instrument str] [--photon-energy float]
+                            [--offset-store str] [--high-res-badpix-3d]
+                            [--db-input] [--deviation-threshold float]
+                            DETECTOR TYPE
+
+    Main entry point for offline calibration
+
+    positional arguments:
+      DETECTOR              The detector to calibrate
+      TYPE                  Type of calibration: LPD,AGIPD
+
+    optional arguments:
+      -h, --help            show this help message and exit
+      --in-folder str       path to input data, required. Default: None
+      --modules str [str ...]
+                            modules to work on, required, range allowed. Default:
+                            None
+      --out-folder str      path to output to, required. Default: None
+      --runs str [str ...]  runs to use, required, range allowed. Default: None
+      --sequences str [str ...]
+                            sequences files to use, range allowed. Default: [0, 1,
+                            2, 3, 4]
+      --cluster-profile str
+                            The ipcluster profile to use. Default: noDB
+      --local-output        output constants locally. Default: True
+
+    ...
+    
+
+.. note::
+
+    Nbparameterise can only parse the mentioned subset of variable types. An expression
+    that evaluates to such a type will note be recognized: e.g. `a = list(range(3))` will
+    not work!
+
+The following table contains a list of suggested names for certain parameters, allowing
+to stay consistent amongst all notebooks.
+
+
+.. table:: Suggested naming of parameters
+
+    Parameter name   To be used for                                                  Special purpose
+    ---------------- --------------------------------------------------------------- --------------------------
+    in_folder        the input path data resides in, usually without a run number
+    out_folder       path to write data out to, usually without a run number         reports can be placed here
+    run(s)           which XFEL DAQ runs to use, often ranges are allowed
+    modules          refers to the modules of a segmented detector, ranges often ok.
+    sequences        sequence files for the XFEL DAQ system, ranges are often ok.
+    cluster_profile  name of the cluster profile for ipcluster                       fixed name
+    local_input      read calibration constant from file, not database
+    local_output     write calibration constant from file, not database
+    db_input         read calibration constant from database, not file
+    db_output        write calibration constant from database, not file
+    cal_db_interface the calibration database host in form of "tcp://host:port"
+
+
+
+Best Coding Practices
+---------------------
+
+In principle there a not restrictions other than that parameters that are exposed to the
+command line need to be defined in the first code cell of the notebook.
+
+However, a few guidelines should be observered to make notebook useful for display as
+reports and usage by other.
+
+External Libraries
+~~~~~~~~~~~~~~~~~~
+
+You may use a wide variaty of libraries available in Python, but keep in mind that others
+wanting to run the tool will need to install these requirements as well. Thus,
+
+* do not use a specialized tool if an accepted alternative exists. Plots e.g. should usually
+  be created using `matplotlib_` and numerical processing should be done in `numpy_`.
+
+* keep runtimes and library requirements in mind. A library doing its own parallelism either
+  needs to programatically be able to set this up, or automatically do so. If you need to
+  start something from the command line first, things might be tricky as you will likely 
+  need to run this via `POpen` commands with appropriate environment variable.
+  
+Writing out data
+~~~~~~~~~~~~~~~~
+
+If your notebook produces output data, consider writing data out as early as possible,
+such that it is available as soon as possible. Detailed plotting and inspection can
+possibly done later on in a notebook.
+
+Also consider using HDF5 via `h5py_` as your output format. If you correct or calibrated
+input data, which adhears to the XFEL naming convention, you should maintain the convention
+in your output data. You should not touch any data that you do not actively work on and
+should assure that the `INDEX` and identifier entries are syncronized with respect to 
+your output data. E.g. if you remove pulses from a train, the `INDEX/.../count` section
+should reflect this.
+
+Finally, XFEL RAW data can contain filler data from the DAQ. One possible way of identifying
+this data is the following::
+
+    datapath = "/INSTRUMENT/FXE_DET_LPD1M-1/DET/{}CH0:xtdf/image/cellId".format(channel)
+    
+    count = np.squeeze(infile[datapath])        
+    first = np.squeeze(infile[datapath])
+    if np.count_nonzero(count != 0) == 0:  # filler data has counts of 0
+        print("File {} has no valid counts".format(infile))
+        return
+    valid = count != 0
+    idxtrains = np.squeeze(infile["/INDEX/trainId"])
+    medianTrain = np.nanmedian(idxtrains)  # protect against freak train ids
+    valid &= (idxtrains > medianTrain - 1e4) & (idxtrains < medianTrain + 1e4)
+    
+    # index ranges in which non-filler data exists
+    last_index = int(first[valid][-1]+count[valid][-1])
+    first_index = int(first[valid][0])
+    
+    # access these indices
+    cellIds = np.squeeze(np.array(infile[datapath][first_index:last_index, ...]))
+    
+
+Plotting
+~~~~~~~~
+
+When creating plots, make sure that the plot is either self-explanatory or add markdown
+comments with adequate description. Do not add "free-floating" plots, always put them into
+a context. Make sure to label your axes.
+
+Also make sure the plots are readable on an A4-sized PDF page; this is the format the notebook
+will be rendered to for report outputs. Specifically, this means that figure sizes should not
+exeed approx 15x15 inches.
+
+The report will contain 150 dpi png images of your plots. If you need higher quality output
+of individual plot files you should save these separetly, e.g. via `fig.savefig(...)` yourself.
+
+
+.. _nbparameterise: https://github.com/takluyver/nbparameterise
+.. _ipcluster: https://ipyparallel.readthedocs.io/en/latest/
+.. _matplotlib: https://matplotlib.org/
+.. _numpy: http://www.numpy.org/
+.. _h5py: https://www.h5py.org/
\ No newline at end of file