[slurm-users] Graphing job metrics
remi at rezib.org
Tue Nov 14 05:33:32 MST 2017
Le 13/11/2017 à 18:18, Nicholas McCollum a écrit :
> Now that there is a slurm-users mailing list, I thought I would share
> something with the community that I have been working on to see if anyone else
> is interested in it. I have a lot of students on my cluster and I really
> wanted a way to show my users how efficient their jobs are, or let them know
> that they are wasting resources.
> I created a few scripts that leverage Graphite and whisper databases (RRD like)
> to gather metrics from Slurm jobs running in cgroups. The resolution for the
> metrics is defined by the retention interval that you specify in graphite. In
> my case I can store 1 minute metrics for CPU usage and Memory usage for the
> entire lifetime of a job.
FWIW, we wrote at EDF a collectd plugin some time ago that does
basically the same thing, ie. exploring the cgroups to get cpu/memory
metrics out of jobs' processes. Code is here:
Then, you gain all collectd flexibility in terms of metrics processing
and backends (graphite, RRD, influxdb, and so on).
We also wrote a tiny web interface to visualize the metrics. One can
find out more by searching 'jobmetrics' in the following slides:
NB: my intent is just to share, not to steal the thread. Please forgive
me if you take it the wrong way.
More information about the slurm-users