[slurm-users] Graphing job metrics

Fri Jan 5 08:07:18 MST 2018

On Monday, 13 November 2017 11:18:08 CET Nicholas McCollum wrote:
> Now that there is a slurm-users mailing list, I thought I would share
> something with the community that I have been working on to see if anyone
> else is interested in it.  I have a lot of students on my cluster and I
> really wanted a way to show my users how efficient their jobs are, or let
> them know that they are wasting resources.
> 
> I created a few scripts that leverage Graphite and whisper databases (RRD
> like) to gather metrics from Slurm jobs running in cgroups.  The resolution
> for the metrics is defined by the retention interval that you specify in
> graphite.  In my case I can store 1 minute metrics for CPU usage and Memory
> usage for the entire lifetime of a job.
> 
> From these databases, I have written scripts that can notify me if a user
> job is wasting resources, like requesting 64 cores when their application
> only scales to 8.
> 
> I have also created a script that will allow a user to cURL a Grafana
> instance to graph their job metrics and create graphs.
> 
> If anyone is interested I wrote something real quickly at:
> https://xathor.blogspot.com/2017/11/graphing-slurm-cgroup-job-metrics.html
> 
> If there's interest I would be more than happy to polish the code a little
> and share it on github.
> 
> I am also at SC17 if anyone wants to meet up and check it out in person.

netdata (https://github.com/firehol/netdata) also provides such information 
collected from cgroups in real-time with 1 hour history. It can be configured 
to use back-ends to archive the metrics.

regards
Markus Köberl
-- 
Markus Koeberl
Graz University of Technology
Signal Processing and Speech Communication Laboratory
E-mail: markus.koeberl at tugraz.at