[slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

Tue Jul 25 02:44:52 UTC 2023

I run a cluster we bought from ACT and recently updated to ClusterVisor v1.0

The new version has (among many things) a really nice view of individual
jobs resource utilization (GPUs, memory, CPU, temperature, etc). I did not
pay attention to the overall statistics, so I am not sure how CV fares
there -- because I care only about individual jobs (I work with individual
users, and don't care about overall utilization, which is info for the
upper management). At the moment only admins can see the info, but my
understanding is that they are considering making it a user-space feature,
which will be really slick.

Several years ago I used XDMOD and Supremm and it was more confusing to use
and had troubles collecting all the data we needed (which the team blamed
on some BIOS settings), so the view was incomplete. Also, the tool seemed
to be more focused on the overall stats rather than per job info (both were
available, but the focus seemed on the former). I am sure these tools have
improved since then, so I'm not dismissing them, just giving my opinion
based on old facts. Comparing that old version of XDMOD to current CV
(unfair, I know, but that's the comparison I've got) the latter wins hands
down for per-job information. Also probably unfair is that XDMOD and
Supremm are free and open source whereas CV is proprietary.

On Mon, Jul 24, 2023 at 2:57 PM Magnus Jonsson <magnus.jonsson at umu.se>
wrote:

> We are feeding job usage information into a Prometheus database for our
> users (and us) to look at (via Grafana).
>
> It is also possible to get a lite of jobs that are under using memory, gpu
> or whatever metric you feed into the database.
>
>
>
> It’s a live feed with ~30s resolution from both compute jobs and Lustre
> file system.
>
> It’s easy to extend with more metrices.
>
>
>
> If you want more information on what we are doing just send me an email
> and I can give you more information.
>
>
>
> /Magnus
>
>
>
> --
>
> Magnus Jonsson, Developer, HPC2N, Umeå Universitet
>
> By sending an email to Umeå University, the University will need to
>
> process your personal data. For more information, please read
> www.umu.se/en/gdpr
>
> *Från:* slurm-users <slurm-users-bounces at lists.schedmd.com> *För *Will
> Furnell - STFC UKRI
> *Skickat:* Monday, 24 July 2023 16:38
> *Till:* slurm-users at schedmd.com
> *Ämne:* [slurm-users] Tracking efficiency of all jobs on the cluster
> (dashboard etc.)
>
>
>
> Hello,
>
>
>
> I am aware of ‘seff’, which allows you to check the efficiency of a single
> job, which is good for users, but as a cluster administrator I would like
> to be able to track the efficiency of all jobs from all users on the
> cluster, so I am able to ‘re-educate’ users that may be running jobs that
> have terrible resource usage efficiency.
>
>
>
> What do other cluster administrators use for this task? Is there anything
> you use and recommend (or don’t recommend) or have heard of that is able to
> do this? Even if it’s something like a Grafana dashboard that hooks up to
> the SLURM database,
>
>
>
> Thank you,
>
>
>
> Will.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230724/fe2c1661/attachment.htm>