[slurm-users] Graphing job metrics
Markus Köberl
markus.koeberl at tugraz.at
Fri Jan 5 08:07:18 MST 2018
On Monday, 13 November 2017 11:18:08 CET Nicholas McCollum wrote:
> Now that there is a slurm-users mailing list, I thought I would share
> something with the community that I have been working on to see if anyone
> else is interested in it. I have a lot of students on my cluster and I
> really wanted a way to show my users how efficient their jobs are, or let
> them know that they are wasting resources.
>
> I created a few scripts that leverage Graphite and whisper databases (RRD
> like) to gather metrics from Slurm jobs running in cgroups. The resolution
> for the metrics is defined by the retention interval that you specify in
> graphite. In my case I can store 1 minute metrics for CPU usage and Memory
> usage for the entire lifetime of a job.
>
> From these databases, I have written scripts that can notify me if a user
> job is wasting resources, like requesting 64 cores when their application
> only scales to 8.
>
> I have also created a script that will allow a user to cURL a Grafana
> instance to graph their job metrics and create graphs.
>
> If anyone is interested I wrote something real quickly at:
> https://xathor.blogspot.com/2017/11/graphing-slurm-cgroup-job-metrics.html
>
> If there's interest I would be more than happy to polish the code a little
> and share it on github.
>
> I am also at SC17 if anyone wants to meet up and check it out in person.
netdata (https://github.com/firehol/netdata) also provides such information
collected from cgroups in real-time with 1 hour history. It can be configured
to use back-ends to archive the metrics.
regards
Markus Köberl
--
Markus Koeberl
Graz University of Technology
Signal Processing and Speech Communication Laboratory
E-mail: markus.koeberl at tugraz.at
More information about the slurm-users
mailing list