[slurm-users] Validating SLURM sreport cluster utilization report

David Simpson SimpsonD4 at cardiff.ac.uk
Fri Jan 22 16:33:42 UTC 2021


Hi,

We've been using the sreport cluster utilization report to report on Down time and therefore produce an uptime figure for the entire cluster. Which we hope will be above 99% or very close to, for every month of the year.

Most of the time the figure that comes back is one that fits the perception of the day to day running of the cluster.

We don't log node UP/DOWN in any way (beyond what slurm does) and rely on sreport as explained above.

The December figure we have is lower than 99% and there are 438 slurm nodes in the cluster. In December we only remember having problems with 3 nodes. So at the moment off the top of the head we don't understand this reported Down time.

Is anyone else relying on sreport for this metric? If so have you encountered this sort of situation?

regards
David


-------------
David Simpson - Senior Systems Engineer
ARCCA, Redwood Building,
King Edward VII Avenue,
Cardiff, CF10 3NB

David Simpson - peiriannydd uwch systemau
ARCCA, Adeilad Redwood,
King Edward VII Avenue,
Caerdydd, CF10 3NB

simpsond4 at cardiff.ac.uk<mailto:simpsond4 at cardiff.ac.uk>
+44 29208 74657

COVID-19 Cardiff University is currently under remote work restrictions. Our staff are continuing normal work schedules, but responses may be slower than usual.  We appreciate your patience during this unprecedented time

COVID-19 Ar hyn o bryd mae Prifysgol Caerdydd o dan gyfyngiadau gweithio o bell.  Mae ein staff yn parhau ag amserlenni gwaith arferol, ond gall ymatebion fod yn arafach na'r arfer. Rydym yn gwerthfawrogi eich amynedd yn ystod yr amser digynsail hwn.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210122/e7cc7146/attachment.htm>


More information about the slurm-users mailing list