[slurm-users] Graphing job metrics

Tue Nov 14 05:33:32 MST 2017

Hi there,

Le 13/11/2017 à 18:18, Nicholas McCollum a écrit :
> Now that there is a slurm-users mailing list, I thought I would share
> something with the community that I have been working on to see if anyone else
> is interested in it.  I have a lot of students on my cluster and I really
> wanted a way to show my users how efficient their jobs are, or let them know
> that they are wasting resources.
> 
> I created a few scripts that leverage Graphite and whisper databases (RRD like)
> to gather metrics from Slurm jobs running in cgroups.  The resolution for the
> metrics is defined by the retention interval that you specify in graphite.  In
> my case I can store 1 minute metrics for CPU usage and Memory usage for the
> entire lifetime of a job.

FWIW, we wrote at EDF a collectd[1] plugin some time ago that does 
basically the same thing, ie. exploring the cgroups to get cpu/memory 
metrics out of jobs' processes. Code is here:

   https://github.com/collectd/collectd/pull/1198

Then, you gain all collectd flexibility in terms of metrics processing 
and backends (graphite, RRD, influxdb, and so on).

We also wrote a tiny web interface to visualize the metrics. One can 
find out more by searching 'jobmetrics' in the following slides:

   https://slurm.schedmd.com/SLUG16/EDF.pdf

NB: my intent is just to share, not to steal the thread. Please forgive 
me if you take it the wrong way.

Best,
Rémi

[1] https://collectd.org/