[slurm-users] Graphing job metrics

Mon Nov 13 10:18:08 MST 2017

Now that there is a slurm-users mailing list, I thought I would share
something with the community that I have been working on to see if anyone else
is interested in it.  I have a lot of students on my cluster and I really
wanted a way to show my users how efficient their jobs are, or let them know
that they are wasting resources.  

I created a few scripts that leverage Graphite and whisper databases (RRD like)
to gather metrics from Slurm jobs running in cgroups.  The resolution for the
metrics is defined by the retention interval that you specify in graphite.  In
my case I can store 1 minute metrics for CPU usage and Memory usage for the
entire lifetime of a job.  

>From these databases, I have written scripts that can notify me if a user job
is wasting resources, like requesting 64 cores when their application only
scales to 8.  

I have also created a script that will allow a user to cURL a Grafana instance
to graph their job metrics and create graphs.

If anyone is interested I wrote something real quickly at:
https://xathor.blogspot.com/2017/11/graphing-slurm-cgroup-job-metrics.html

If there's interest I would be more than happy to polish the code a little and
share it on github.

I am also at SC17 if anyone wants to meet up and check it out in person.

Thanks!

---

Nicholas McCollum
HPC Systems Administrator
Alabama Supercomputer Authority