[slurm-users] Graphing job metrics

Rémi Palancher remi at rezib.org
Tue Nov 14 05:33:32 MST 2017

Hi there,

Le 13/11/2017 à 18:18, Nicholas McCollum a écrit :
> Now that there is a slurm-users mailing list, I thought I would share
> something with the community that I have been working on to see if anyone else
> is interested in it.  I have a lot of students on my cluster and I really
> wanted a way to show my users how efficient their jobs are, or let them know
> that they are wasting resources.
> I created a few scripts that leverage Graphite and whisper databases (RRD like)
> to gather metrics from Slurm jobs running in cgroups.  The resolution for the
> metrics is defined by the retention interval that you specify in graphite.  In
> my case I can store 1 minute metrics for CPU usage and Memory usage for the
> entire lifetime of a job.

FWIW, we wrote at EDF a collectd[1] plugin some time ago that does 
basically the same thing, ie. exploring the cgroups to get cpu/memory 
metrics out of jobs' processes. Code is here:


Then, you gain all collectd flexibility in terms of metrics processing 
and backends (graphite, RRD, influxdb, and so on).

We also wrote a tiny web interface to visualize the metrics. One can 
find out more by searching 'jobmetrics' in the following slides:


NB: my intent is just to share, not to steal the thread. Please forgive 
me if you take it the wrong way.


[1] https://collectd.org/

More information about the slurm-users mailing list