Example offline analysis.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "59f50187-f73f-471b-b668-9126e6f48501",
   "metadata": {},
   "source": [
    "# Automated virtual spectrometer example offline analysis\n",
    "\n",
    "This is an example notebook showing how to use the `pes_to_spec` infrastructure in this package. The objective here is to calibrate the photo-electron spectrometer data automatically. This is done by using data from one \"training\" run that contains data from both the photo-electron spectrometer (PES), XGM and the grating spectrometer and then using the correlation between them to calibrate the PES data when no grating spectrometer is available.\n",
    "\n",
    "This notebook includes the main analysis to simple get a calibrated PES spectrum using this automated method, as well as further analyses on the output in a second section. The objective of the second sections are only to validate results and may probably be skipped in most cases, if one needs only the PES spectrum.\n",
    "\n",
    "We start by importing some modules. The key module here is called `pes_to_spec`. It can be cloned from its repository using the following command (for example) in Maxwell:\n",
    "\n",
    "`git clone https://git.xfel.eu/machineLearning/pes_to_spec.git`\n",
    "\n",
    "After that the notebook in the directory `pes_to_spec/notebook` can be opened and executed using the kernel `xfel (current)`. The specialized environment mentioned in the `README.md` file of the package may offer a slightly better performance and allow for expert features, but this is not necessary for the standard analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d0003a3-4165-447d-bdb5-758c323a3fbf",
   "metadata": {},
   "source": [
    "## Core automated virtual spectromater usage"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aadd23ab-b894-467c-bb28-c1d25b8e94a2",
   "metadata": {},
   "source": [
    "Here we expand on the usage of the virtual spectrometer. We assume that 2 runs of data have been collected. A first run contains both XGM, PES and grating spectrometer data.\n",
    "\n",
    "A test run may contain only the XGM and PES data. If the test run contains also the grating spectrometer, it can be used also for validation, but in a real use-case this may not be available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "d44af0b6-9c00-4e70-b49b-d74ed562e92f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "# add the pes_to_spec main directory\n",
    "# (change this depending on where you started the notebook if needed, or comment it out if you have done pip install in pes_to_spec)\n",
    "sys.path.append('..')\n",
    "\n",
    "# you meay need to do pip install matplotlib seaborn extra_data for this notebook, additionally\n",
    "# for this notebook the following packages are needed:\n",
    "# pip install \"numpy>=1.21\" \"scipy>=1.6\" \"scikit-learn>=1.2.0\" torch torchbnn  matplotlib seaborn extra_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "da002d3e-c0da-419b-922b-0ab5c6deece8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warning: BNN model disabled. It requires PyTorch. Check if it is installed.\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.gridspec import GridSpec \n",
    "import seaborn as sns\n",
    "\n",
    "import lmfit\n",
    "import scipy\n",
    "from extra_data import open_run, by_id\n",
    "from itertools import product\n",
    "from pes_to_spec.model import get_model_with_resolution, Model, matching_ids, matching_two_ids\n",
    "\n",
    "from typing import Any, Dict"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "494a729c-dff4-4501-b828-fba2aaae5a23",
   "metadata": {},
   "source": [
    "### Input data\n",
    "\n",
    "Note that the data in the training run must be large enough, compared to the number of model parameters.\n",
    "\n",
    "Only the grating spectrometer, PES and XGM data is used for training, while only the PES and XGM data is needed for testing.\n",
    "However, more data is collected here to validate the results.\n",
    "\n",
    "Please adjust the proposal number, run number and name of the devices below as needed. Specifically, note that the grating spectrometer name changes depending on where the data has been collected. If unsure, try `run.info()` to check if the it contains `SPECTROMETER_SCS_NAVITAR`, `SPECTROMETER_SQS_NAVITAR` or `SPECTROMETER_SXP_NAVITAR`, etc.\n",
    "\n",
    "Additionally, please check that the list of PES channels includes the channels active during the runs. It may be that data from all channels are written to disk, but not all channels are active (in which case they contain only noise). This would not hurt the procedure, but could lead to erroneous comparison between the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4a301f2a-dedb-46e4-b096-fc9c6cf5b23a",
   "metadata": {},
   "outputs": [],
   "source": [
    "run = open_run(proposal=900331, run=69)\n",
    "run_test = open_run(proposal=900331, run=70)\n",
    "\n",
    "# useful names to avoid repeating it all over the notebook, in case they ever change\n",
    "grating_name = \"SA3_XTD10_SPECT/MDL/SPECTROMETER_SCS_NAVITAR:output\"\n",
    "pes_name = \"SA3_XTD10_PES/ADC/1:network\"\n",
    "xgm_name = \"SA3_XTD10_XGM/XGM/DOOCS:output\"\n",
    "\n",
    "pres_name = \"SA3_XTD10_PES/GAUGE/G30310F\"\n",
    "volt_name = \"SA3_XTD10_PES/MDL/DAQ_MPOD\"\n",
    "\n",
    "# PES channels\n",
    "channels = [f\"channel_{i}_{l}\" for i, l in product([1, 3, 4], [\"A\", \"B\", \"C\", \"D\"])]\n",
    "\n",
    "def get_gas(run) -> str:\n",
    "    \"\"\"Get gas in chamber for logging.\"\"\"\n",
    "    gas_sources = [\n",
    "                  \"SA3_XTD10_PES/DCTRL/V30300S_NITROGEN\",\n",
    "                  \"SA3_XTD10_PES/DCTRL/V30310S_NEON\",\n",
    "                  \"SA3_XTD10_PES/DCTRL/V30320S_KRYPTON\",\n",
    "                  \"SA3_XTD10_PES/DCTRL/V30330S_XENON\",\n",
    "              ]\n",
    "    gas_active = list()\n",
    "    for gas in gas_sources:\n",
    "        # check if this gas source is interlocked\n",
    "        if gas in run.all_sources and run[gas, \"interlock.AActionState.value\"].ndarray().sum() == 0:\n",
    "            # it is not, so this gas was used\n",
    "            gas_active += [gas.split(\"/\")[-1].split(\"_\")[-1]]\n",
    "    gas = \"_\".join(gas_active)\n",
    "    return gas\n",
    "\n",
    "def get_tids(run) -> np.ndarray:\n",
    "    \"\"\"Get which train IDs contain all necessary inputs for training.\"\"\"\n",
    "    if grating_name in run.all_sources:\n",
    "        spec_tid = run[grating_name, \"data.trainId\"].ndarray()\n",
    "    else:\n",
    "        spec_tid = None\n",
    "    pes_tid = run[pes_name, \"digitizers.trainId\"].ndarray()\n",
    "    xgm_tid = run[xgm_name, \"data.trainId\"].ndarray()\n",
    "\n",
    "    # match tids to be sure we have all inputs:\n",
    "    if spec_tid is None:\n",
    "        tids = matching_two_ids(pes_tid, xgm_tid)\n",
    "    else:\n",
    "        tids = matching_ids(spec_tid, pes_tid, xgm_tid)\n",
    "\n",
    "    return tids\n",
    "\n",
    "def get_data(run, tids) -> Dict[str, Any]:\n",
    "    \"\"\"Get all relevant data.\"\"\"\n",
    "    data = dict()\n",
    "    data[\"intensity\"] = run[xgm_name, \"data.intensitySa3TD\"].select_trains(by_id[tids]).ndarray()[:, 0][:, np.newaxis]\n",
    "    data[\"pes\"] = {ch: run[pes_name,\n",
    "                           f\"digitizers.{ch}.raw.samples\"].select_trains(by_id[tids]).ndarray()\n",
    "                    for ch in channels}\n",
    "    # this may not be available in testing\n",
    "    if grating_name in run.all_sources:\n",
    "        data[\"grating\"] = run[grating_name, \"data.intensityDistribution\"].select_trains(by_id[tids]).ndarray()\n",
    "        data[\"energy\"] = run[grating_name, \"data.photonEnergy\"].select_trains(by_id[tids]).ndarray()\n",
    "    # only for validation\n",
    "    data[\"pressure\"] = run[pres_name, \"value\"].select_trains(by_id[tids]).ndarray()\n",
    "    data[\"voltage\"] = run[volt_name, \"u213.value\"].select_trains(by_id[tids]).ndarray()\n",
    "    data[\"gas\"] = get_gas(run)\n",
    "    return data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "210c0550-1abb-43a0-99a5-7c35d2766be0",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# get the matched train IDs\n",
    "tids = get_tids(run)\n",
    "test_tids = get_tids(run_test)\n",
    "\n",
    "# get the data\n",
    "data = get_data(run, tids)\n",
    "data_test = get_data(run_test, test_tids)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "017865a1-057f-48c7-8bef-a6e40490de2c",
   "metadata": {},
   "source": [
    "Now the `data` and `data_test` dictionaries contain the necessary information about the training and test runs.\n",
    "The code above also selected only entries with train IDs on which at least SPEC, PES and XGM were present.\n",
    "\n",
    "Note that for training, it is assumed that only one pulse is present. For testing there is no such requirement.\n",
    "\n",
    "First output some general information about the conditions of the measurement device.\n",
    "\n",
    "The method expects the run conditions to be the same in training and inference, so check that there is not a significant deviation here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "956105a6-d37e-453c-bfeb-2b1c876ee3f2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Gas in training: NEON\n",
      "Gas in testing: NEON\n"
     ]
    }
   ],
   "source": [
    "print(f\"Gas in training: {data['gas']}\")\n",
    "print(f\"Gas in testing: {data_test['gas']}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "4654f205-edc6-45f7-97bd-0d088c38edb0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Voltage in training: -116.01 +/- 0.01\n",
      "Voltage in testing: -116.00 +/- 0.01\n"
     ]
    }
   ],
   "source": [
    "print(f\"Voltage in training: {np.mean(data['voltage']):0.2f} +/- {np.std(data['voltage']):0.2f}\")\n",
    "print(f\"Voltage in testing: {np.mean(data_test['voltage']):0.2f} +/- {np.std(data_test['voltage']):0.2f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "fa662544-3caa-4404-bb61-fa41add82642",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pressure in training: 1.29e-06 +/- 3.95e-08\n",
      "Pressure in testing: 1.29e-06 +/- 3.91e-08\n"
     ]
    }
   ],
   "source": [
    "print(f\"Pressure in training: {np.mean(data['pressure']):0.2e} +/- {np.std(data['pressure']):0.2e}\")\n",
    "print(f\"Pressure in testing: {np.mean(data_test['pressure']):0.2e} +/- {np.std(data_test['pressure']):0.2e}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9b62e3a-aff9-4436-905d-ea02c11aecf6",
   "metadata": {},
   "source": [
    "### Using the virtual spectrometer"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5962e483-60da-4c70-bb09-dce5fc9745e0",
   "metadata": {},
   "source": [
    "Now we will actually train the model. We do that by creating a `Model` object (from `pes_to_spec`) and fitting the data. It requires the PES intensity, the grating spec. intensity, the energy axis from grating spec., as well as the energy measured in the XGM (which has better resolution than the integral of the PES).\n",
    "\n",
    "We actually do this twice: in the first time we do it without any preprocessing on the grating spectrometer data. After that step, we record the maximum resolution achievable with the virtual spectrometer and then redo the estimate using the discovered resolution to pre-process the grating spectrometer data, so that this information is taken into account in the uncertainty estimate.\n",
    "\n",
    "The work of calculating the expected resolution and adapting to it is done by the `get_model_with_resolution`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "70c4e386-61b2-49e7-b759-8c0175ee457a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Checking data quality in high-resolution data.\n",
      "Selected 7165 of 7165 samples\n",
      "Fitting PCA on low-resolution data.\n",
      "Using 556 comp. for PES PCA.\n",
      "Fitting PCA on high-resolution data.\n",
      "Using 24 comp. for grating spec. PCA.\n",
      "Fitting outlier detection\n",
      "Fitting model.\n",
      "Calculate PCA unc. on high-resolution data.\n",
      "Calculate transfer function\n",
      "Resolution: 0.8169395778859361\n",
      "Calculate PCA on channel_1_A\n",
      "Calculate PCA on channel_1_B\n",
      "Calculate PCA on channel_1_C\n",
      "Calculate PCA on channel_1_D\n",
      "Calculate PCA on channel_3_A\n",
      "Calculate PCA on channel_3_B\n",
      "Calculate PCA on channel_3_C\n",
      "Calculate PCA on channel_3_D\n",
      "Calculate PCA on channel_4_A\n",
      "Calculate PCA on channel_4_B\n",
      "Calculate PCA on channel_4_C\n",
      "Calculate PCA on channel_4_D\n",
      "End of fit.\n",
      "Resolution: 0.82 eV\n"
     ]
    }
   ],
   "source": [
    "model, resolution = get_model_with_resolution(data['pes'],\n",
    "                                              data['grating'],\n",
    "                                              data['energy'],\n",
    "                                              data['intensity'],\n",
    "                                              channels=channels,\n",
    "                                              )\n",
    "print(f\"Resolution: {resolution:.2f} eV\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f6c9d0a-29af-42c9-ba3c-08aae561232b",
   "metadata": {},
   "source": [
    "We can save the resulting model for later usage using the `save` method and then reload it later with the `load` function as shown below. This can be useful if one wants to reuse information at a later stage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "a0adb57b-7496-4781-9511-ac2a8d05658d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# save it for later usage:\n",
    "model.save(\"model.joblib\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "67616041-4239-48f0-a4f7-2caf5d896652",
   "metadata": {},
   "outputs": [],
   "source": [
    "# load a model -- you can skip the fit above if you just start from this line\n",
    "#model = Model.load(\"model.joblib\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de6cf667-2f50-47c2-983a-a6c0b2d17714",
   "metadata": {},
   "source": [
    "Now we can use this mapping on new data. In this example, we try it in the test dataset mentioned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "917156f3-9476-48e0-9121-5f75f185045f",
   "metadata": {},
   "outputs": [],
   "source": [
    "pred = model.predict(data_test['pes'], pulse_energy=data_test['intensity'])\n",
    "\n",
    "# add the references in this array in the same array format, so we can plot them later\n",
    "pred[\"energy\"] = model.get_energy_values()\n",
    "pred['grating'] = data_test['grating'][:, np.newaxis, :]\n",
    "# let us show also how the virtual spectrometer looks like smeared by the virtual spectrometer resolution\n",
    "pred['grating_smooth'] = model.preprocess_high_res(data_test['grating'], resolution=model.resolution)[:, np.newaxis, :]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77866435-1cb6-40ac-a9a5-8f2ae37017b7",
   "metadata": {},
   "source": [
    "Let's try to predict in the independent run in the test dataset. The performance of the model varies a lot if the beam intensity is very different from the training one. To ensure we take a train ID to visualize that is relatively high intensity, we sort the train IDs by XGM intensity and then choose the highest intensity one.\n",
    "One could try other train IDs.\n",
    "\n",
    "For train IDs with close to zero beam intensity, there is a relatively larger error, since the training data did not contain any of those samples and the signal-to-noise ratio is relatively high."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "ed62606a-4ea7-4e0a-8b61-73e682cacf04",
   "metadata": {},
   "outputs": [],
   "source": [
    "# choose train ID of the test dataset by XGM intensity\n",
    "test_intensity = np.argsort(data_test['intensity'][:,0])\n",
    "example_tid = test_intensity[-1]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f931e9e0-84a7-4e4f-bfad-588bfe77267c",
   "metadata": {},
   "source": [
    "Now we can actually plot it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "fd42984c-554c-4c69-bf8a-119eeb0cca62",
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot(data):\n",
    "    \"\"\"Plot prediction and expectation.\"\"\"\n",
    "    fig = plt.figure(figsize=(12, 8))\n",
    "    gs = GridSpec(1, 1)\n",
    "    ax = fig.add_subplot(gs[0, 0])\n",
    "    ax.plot(data[\"energy\"], data[\"grating\"], c='b', lw=3, label=\"Grating spec. measurement\")\n",
    "    ax.plot(data[\"energy\"], data[\"grating_smooth\"], c='g', lw=3, label=\"Smoothened grating spec. measurement\")\n",
    "    ax.plot(data[\"energy\"], data[\"expected\"], c='r', ls='--', lw=3, label=\"Prediction\")\n",
    "    ax.fill_between(data[\"energy\"], data[\"expected\"] - data[\"total_unc\"], data[\"expected\"] + data[\"total_unc\"], facecolor='gold', alpha=0.5, label=\"68% unc.\")\n",
    "    ax.legend(frameon=False, borderaxespad=0, loc='upper left')\n",
    "    ax.spines['top'].set_visible(False)\n",
    "    ax.spines['right'].set_visible(False)\n",
    "    Y = np.amax(data[\"grating\"])\n",
    "    ax.set(\n",
    "            xlabel=\"Photon energy [eV]\",\n",
    "            ylabel=\"Intensity\",\n",
    "            ylim=(0, 1.3*Y))\n",
    "    plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "xfel (current)",
   "language": "python",
   "name": "xfel-current"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}