[slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

Fri Sep 8 08:44:56 UTC 2023

I've been needing to do this as part of some analysis work we are undertaking to determine requirements for a replacement system.

We don't have anything structured in place currently to analyse Slurm data; lots of Grafana system-level metrics but nothing to look at trends of useful metrics like:

- Size (and age) of jobs sitting in the pending state
- Average runtime of jobs
- Plotting workload sizing information such as cores/job and memory/core so that we can understand how our users are utilising the service
- Demand (and utilisation) of particular partitions

I couldn't find anything that was exactly what we wanted, so I spent a couple of afternoons last week putting something together in Python to wrap around sacct / sinfo output.

So far I've got reports for what is happening 'now', as well as summaries for the following periods:

24 hours
7 days
30 days
1 year

Data is analysed based on jobs running/pending/completed/failed during windows in time and summarised in terms of sample periods per day (a 24 report having the finest sampling resolution of 6x 10 minute windows per hour), and the output of each sample period is stored as a persistent json object on the filesystem in case the same report is ran again, or that period is included as part of a larger analysis window.

I output to flat HTML files using the Jinja2 templating module and visualise data using the ubiquitous Highcharts and DataTables javascript libraries.

In our case we're more interested in things like:

Min/Max/Median cores/job, plus lowest average value which would satisfy X% of all jobs
Min/Max/Median memory/core, plus lowest average value which would satisfy X% of all jobs
Min/Max/Median nodes/job, plus lowest average value which would satisfy X% of all jobs
Backlog of jobs waiting in pending state
Percentage of jobs that 'fail' (end up in some state other than completed)
Scatter chart of cores/job to memory/core (i.e. what is the bulk of our user workload; parallel/serial, low memory/high memory?)

i.e. data points which will be useful in our sizing decisions of a replacement platform, both in terms of hardware, as well as partition definitions.

When it's at a point where it is useable, I'm sure that we can share the code. It's pretty much self-contained; the only dependencies being Slurm and Python 3 installed - no web components needed (unless you want to serve the generated reports to users, of course).

John Snowdon
Advanced Computing Consultant

Newcastle University IT Service
The Elizabeth Barraclough Building
91 Sandyford Road
Newcastle upon Tyne, 
NE1 8HW