Virtual spectrometer
The aim of this project is to read low-resolution photon spectrometer measurements and predict, with an uncertainty band, a high-resolution invasive photon spectrometer result, in such a way that one may continue to collect low-resolution data without stopping the beam, but obtaining nevertheless, high-resolution results.
The concept is to collect both results simultaneously during a training phase and use it to learn a model that may be used later, under the same conditions, but without the high-resolution, invasive spectrometer. The idea is that minimum tuning of the parameters of this methods are needed, so that if the data for training is available, no fine-tuning is required.
Simple usage in Maxwell
The dependencies in the standard kernel xfel (current)
are enough to run the example notebook. All one needs to do is copy the main code in this package to a given directory.
This can be done by typing this command in a Maxwell machine:
git clone https://git.xfel.eu/machineLearning/pes_to_spec.git
cd pes_to_spec
Afterwards open a Jupyter Lab instance (max-jhub.desy.de), go to the directory where the package has been copied and open the example notebook in notebook/Example offline analysis.ipynb
.
One should be able to run the example analysis in the notebook.
Full installation of dependencies from scratch
The dependencies can be installed in a separate environment and used from Juyter Lab in Maxwell, by executing the following commands in Maxwell:
# copy it
git clone https://git.xfel.eu/machineLearning/pes_to_spec.git
cd pes_to_spec
# create separate environment
module load exfel exfel-python
python -m venv env
source env/bin/activate
# install dependencies
pip install --force-reinstall --index-url https://pypi.anaconda.org/intel/simple --no-dependencies numpy scipy==1.7.3
pip install numpy scipy==1.7.3
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install joblib scikit-learn
# install dependencies for the analysis in the notebook
pip install matplotlib lmfit seaborn extra_data
# add it to the Jupyter Lab in Maxwell
pip install ipykernel
python -m ipykernel install --user --name pes_to_spec_env
After this, open the example notebook from the notebook
directory in a Jupyter Lab instance (max-jhub.desy.de) and set the kernel to pes_to_spec_env
.
API usage
The API may be used as follows:
from pes_to_spec.model import Model
from itertools import product
# this is the main object holding all
# information needed for training and prediction
# the default parameters should be sufficient in most times
# if some channels are dead, you may want to specify the active channels
channels = [f"channel_{i}_{l}" for i, l in product(range(1, 5), ["A", "B", "C", "D"])]
model = Model(channels=channels)
# this trains the model
# low_resolution_raw_data is a dictionary with keys "channel_[1-4]_[A-D]" and values set to 2D-shaped
# numpy arrays with shape (number_of_train_IDs, features),
# indicating the low resolution spectra for each input channel
# high_resolution_intensity and high_resolution_photon_energy are estimates from the high-resolution invasive spectrometer
model.fit(low_resolution_raw_data,
high_resolution_intensity,
high_resolution_photon_energy,
pulse_energy=beam_intensity)
# save it for later usage:
model.save("model.joblib")
# when performing inference:
# load a model:
model = Model.load("model.joblib")
# and use it to map a low-resolution spectrum to a high-resolution one
# as before, the low_resolution_raw_data refers to a dictionary mapping the channel name
# in the format "channel_[1-4]_[A-D]" to the 2D numpy array with shape (number_of_train_IDs, features)
# all names and shapes must match the format in training, except for the number_of_train_IDs, which may vary
interpolated_high_resolution_data = model.predict(low_resolution_raw_data, pulse_energy=beam_intensity)
# extra useful functions for debugging:
# this may be used to debug the finding of the prompt peak in the low-resolution data
# ideally this finds the peak correctly in the sample low-resolution data used for training
model.debug_peak_finding(low_resolution_raw_data, "test_peak_finding.png")
# this provides a smoothened version of the high-resolution spectrum, filtering sources of noise
# caused by fluctuations below the spectrometer's resolution
# it may be useful for plotting and debugging
high_resolution_filtered = model.preprocess_high_res(high_resolution_intensity,
high_resolution_photon_energy)
The input data shown here may be retrieved using extra_data
as in the following example:
from extra_data import RunDirectory
from itertools import product
run = RunDirectory(f"/gpfs/exfel/exp/SA3/202121/p002935/raw/r0015")
# get train IDs and match them, so we are sure to have information from all needed sources
# in this example, there is an offset of -2 in the SPEC train ID, so correct for it
# this should not be necessary usually
spec_offset = -2
spec_tid = spec_offset + run['SA3_XTD10_SPECT/MDL/FEL_BEAM_SPECTROMETER_SQS1:output',
"data.trainId"].ndarray()
pes_tid = run['SA3_XTD10_PES/ADC/1:network',
"digitizers.trainId"].ndarray()
xgm_tid = run['SA3_XTD10_XGM/XGM/DOOCS:output',
"data.trainId"].ndarray()
# these are the train ID intersection
# this could have been done by a select call in the RunDirectory,
# but it would not correct for the spec_offset
tids = matching_ids(spec_tid, pes_tid, xgm_tid)
train_tids = tids[:-10]
test_tids = tids[-10:]
# read the spec photon energy and intensity
high_resolution_photon_energy = run['SA3_XTD10_SPECT/MDL/FEL_BEAM_SPECTROMETER_SQS1:output',
"data.photonEnergy"].select_trains(by_id[tids - spec_offset]).ndarray()
high_resolution_intensity = run['SA3_XTD10_SPECT/MDL/FEL_BEAM_SPECTROMETER_SQS1:output',
"data.intensityDistribution"].select_trains(by_id[tids - spec_offset]).ndarray()
# read the PES data for each channel
channels = [f"channel_{i}_{l}" for i, l in product(range(1, 5), ["A", "B", "C", "D"])]
low_resolution_raw_data = {ch: run['SA3_XTD10_PES/ADC/1:network',
f"digitizers.{ch}.raw.samples"].select_trains(by_id[tids]).ndarray()
for ch in channels}
Offline analysis test
After installation, the command-line tool offline_analysis
may be used to test the tool in recorded datasets.