Skip to content

Add instance start/stop for netdata to calibrate script

Robert Rosca requested to merge feat/netdata-telemetry into master

Change adds two lines to slurm_calibrate.sh:

singularity instance start --writable-tmpfs --hostname $SLURM_JOB_ID ~/netdata-image.sif netdata-$HOSTNAME-$SLURM_JOB_ID || true
...
singularity instance stop netdata-$HOSTNAME-$SLURM_JOB_ID || true

I have placed a singularity image called netdata-image.sif in the home directory of xcal and xcaltst. These images are build from the netdata docker image, with only some minor

  • The two variable netdata directories (/var/cache/netdata and /var/lib/netdata) have their permissions set to 777 so that they can be written to on the tmpfs layer
  • netdata.conf is copied into the image during the build phase, sets the registry to the one hosted on exfldadev01.desy.de
  • stream.conf is copied into the image during the build phase, sets the netdata instance to a child node that streams data to the registry

Now, whenever a job starts, a node will be added to netdata with the jobid as the node name.

To check the job:

  1. Go on http://exfldadev01.desy.de:19999
  2. Click the arrow (>) in the top left, this shows a list of the jobs, both currently running, and recently completed
  3. Click one of them and you'll be taken to the page for the job

Alternatively, you can go to http://exfldadev01.desy.de:19999/host/{SLURM_JOB_ID} and this will (if the job has been running for at least a few seconds so that netdata had a chance to start up) show you the dashboard without having to click around.

Testing

I tested this with normal submissions, and ones where I try and force the nodes to run out of memory:

#!/bin/bash

O=/gpfs/exfel/data/scratch/roscar/tmp/$(date +%F-%H-%M-%S)

xfel-calibrate agipd CORRECT \
    --report-to $O \
    --in-folder /gpfs/exfel/exp/MID/202201/p002834/raw \
    --out-folder $O \
    --run 121 \
    --ctrl-source-template '{}/MDL/FPGA_COMP' \
    --karabo-id MID_DET_AGIPD1M-1 \
    --karabo-da AGIPD00 AGIPD01 AGIPD02 AGIPD03 AGIPD04 AGIPD05 AGIPD06 AGIPD07 AGIPD08 AGIPD09 AGIPD10 AGIPD11 AGIPD12 AGIPD13 AGIPD14 AGIPD15 \
    --karabo-id-control MID_EXP_AGIPD1M1 \
    --receiver-template '{}CH0' \
    --adjust-mg-baseline \
    --bias-voltage 300 \
    --blc-set-min \
    --blc-stripes \
    --cm-dark-fraction 0.15 \
    --cm-dark-range -30 30 \
    --cm-n-itr 4 \
    --common-mode \
    --ff-gain 1.0 \
    --force-hg-if-below \
    --force-mg-if-below \
    --hg-hard-threshold 1000 \
    --low-medium-gap \
    --mg-hard-threshold 1000 \
    --overwrite \
    --rel-gain \
    --xray-gain \
    --sequences-per-node 8 \
    --max-nodes 1

NB: --sequences-per-node 8 and --max-nodes 1.

Here are a few screenshots from an allocation made for the above submission:

image

image

You can see that there are some gaps in the graph, this is because netdata is designed to have as minimal of an impact as possible on the system, so when there is a heavy load it starts to drop data and scales back the sampling rates.

Additionally you can see that the allocation raised a warning followed by a critical warning:

image

The warnings system is pretty flexible and it's easy to define rules like ram usage, or waiting for disk i/o, as well as rules telling it to send an email for certain alerts.

And there's some other interesting information available, like infiniband traffic:

image

Disc statistics:

image

etc... which might be helpful with troubleshooting other issues.

The information is only held for at most two days, however there are also storage and ram limits, so if too much information is being recorded then it may be dropped faster than that. If something interesting is recorded which is worth looking at in the future there is a button on the top toolbar which lets you "Export a snapshot" which saves the data to a file that can be downloaded and loaded again later.

@calibration thoughts?

Edited by Robert Rosca

Merge request reports