[slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

Mon Jul 24 20:55:33 UTC 2023

We are feeding job usage information into a Prometheus database for our users (and us) to look at (via Grafana).
It is also possible to get a lite of jobs that are under using memory, gpu or whatever metric you feed into the database.

It’s a live feed with ~30s resolution from both compute jobs and Lustre file system.
It’s easy to extend with more metrices.

If you want more information on what we are doing just send me an email and I can give you more information.

/Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
By sending an email to Umeå University, the University will need to
process your personal data. For more information, please read www.umu.se/en/gdpr<http://www.umu.se/en/gdpr>
Från: slurm-users <slurm-users-bounces at lists.schedmd.com> För Will Furnell - STFC UKRI
Skickat: Monday, 24 July 2023 16:38
Till: slurm-users at schedmd.com
Ämne: [slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

Hello,

I am aware of ‘seff’, which allows you to check the efficiency of a single job, which is good for users, but as a cluster administrator I would like to be able to track the efficiency of all jobs from all users on the cluster, so I am able to ‘re-educate’ users that may be running jobs that have terrible resource usage efficiency.

What do other cluster administrators use for this task? Is there anything you use and recommend (or don’t recommend) or have heard of that is able to do this? Even if it’s something like a Grafana dashboard that hooks up to the SLURM database,

Thank you,

Will.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230724/ab8936ae/attachment-0001.htm>