[slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

Jason Simms jsimms1 at swarthmore.edu
Fri Sep 8 17:51:31 UTC 2023


Hello John,

I also am keen to follow your progress, as this is something we would find
extremely useful as well.

Regards,
Jason

On Fri, Sep 8, 2023 at 4:47 AM John Snowdon <John.Snowdon at newcastle.ac.uk>
wrote:

> I've been needing to do this as part of some analysis work we are
> undertaking to determine requirements for a replacement system.
>
> We don't have anything structured in place currently to analyse Slurm
> data; lots of Grafana system-level metrics but nothing to look at trends of
> useful metrics like:
>
> - Size (and age) of jobs sitting in the pending state
> - Average runtime of jobs
> - Plotting workload sizing information such as cores/job and memory/core
> so that we can understand how our users are utilising the service
> - Demand (and utilisation) of particular partitions
>
> I couldn't find anything that was exactly what we wanted, so I spent a
> couple of afternoons last week putting something together in Python to wrap
> around sacct / sinfo output.
>
> So far I've got reports for what is happening 'now', as well as summaries
> for the following periods:
>
> 24 hours
> 7 days
> 30 days
> 1 year
>
> Data is analysed based on jobs running/pending/completed/failed during
> windows in time and summarised in terms of sample periods per day (a 24
> report having the finest sampling resolution of 6x 10 minute windows per
> hour), and the output of each sample period is stored as a persistent json
> object on the filesystem in case the same report is ran again, or that
> period is included as part of a larger analysis window.
>
> I output to flat HTML files using the Jinja2 templating module and
> visualise data using the ubiquitous Highcharts and DataTables javascript
> libraries.
>
> In our case we're more interested in things like:
>
> Min/Max/Median cores/job, plus lowest average value which would satisfy X%
> of all jobs
> Min/Max/Median memory/core, plus lowest average value which would satisfy
> X% of all jobs
> Min/Max/Median nodes/job, plus lowest average value which would satisfy X%
> of all jobs
> Backlog of jobs waiting in pending state
> Percentage of jobs that 'fail' (end up in some state other than completed)
> Scatter chart of cores/job to memory/core (i.e. what is the bulk of our
> user workload; parallel/serial, low memory/high memory?)
>
> i.e. data points which will be useful in our sizing decisions of a
> replacement platform, both in terms of hardware, as well as partition
> definitions.
>
> When it's at a point where it is useable, I'm sure that we can share the
> code. It's pretty much self-contained; the only dependencies being Slurm
> and Python 3 installed - no web components needed (unless you want to serve
> the generated reports to users, of course).
>
> John Snowdon
> Advanced Computing Consultant
>
> Newcastle University IT Service
> The Elizabeth Barraclough Building
> 91 Sandyford Road
> Newcastle upon Tyne,
> NE1 8HW
>
>

-- 
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230908/4b7b7bc1/attachment.htm>


More information about the slurm-users mailing list