[slurm-users] Graphing job metrics
Rémi Palancher
remi at rezib.org
Tue Nov 14 05:33:32 MST 2017
Hi there,
Le 13/11/2017 à 18:18, Nicholas McCollum a écrit :
> Now that there is a slurm-users mailing list, I thought I would share
> something with the community that I have been working on to see if anyone else
> is interested in it. I have a lot of students on my cluster and I really
> wanted a way to show my users how efficient their jobs are, or let them know
> that they are wasting resources.
>
> I created a few scripts that leverage Graphite and whisper databases (RRD like)
> to gather metrics from Slurm jobs running in cgroups. The resolution for the
> metrics is defined by the retention interval that you specify in graphite. In
> my case I can store 1 minute metrics for CPU usage and Memory usage for the
> entire lifetime of a job.
FWIW, we wrote at EDF a collectd[1] plugin some time ago that does
basically the same thing, ie. exploring the cgroups to get cpu/memory
metrics out of jobs' processes. Code is here:
https://github.com/collectd/collectd/pull/1198
Then, you gain all collectd flexibility in terms of metrics processing
and backends (graphite, RRD, influxdb, and so on).
We also wrote a tiny web interface to visualize the metrics. One can
find out more by searching 'jobmetrics' in the following slides:
https://slurm.schedmd.com/SLUG16/EDF.pdf
NB: my intent is just to share, not to steal the thread. Please forgive
me if you take it the wrong way.
Best,
Rémi
[1] https://collectd.org/
More information about the slurm-users
mailing list