[slurm-users] Graphing job metrics
minibit at gmail.com
Wed Nov 15 10:13:34 MST 2017
I developed a plugin around 1.5 years ago that uses the profiling feature
of slurm to collect the resource usage information and sends it to
influxdb. This is not yet merged in the official slurm release, but it may
be in the next 18.x release. If you want to test this there is a branch in
the schedm github repo (https://github.com/SchedMD/slurm/tree/influxdb)
We've had this running since I created it in some mid-sized clusters with
10's of thousands of jobs per day without an issue. We have a retention
policy of 7 days in influx to avoid collecting too much data. We provide
then a grafana dashboard for the users where they can filter by jobid to
see the CPU usage and Memory usage of their jobs.
If you need more details, I'll be glad to answer your questions.
On Tue, Nov 14, 2017 at 6:10 PM, Nicholas McCollum <nmccollum at asc.edu>
> I went to the SchedMD booth last night and talked with the guys. Tim told
> that the Barcelona Supercomputing Center is working on something similar.
> I am
> going to try to meet with their Slurm person and compare notes.
> I'm also going to look into trying InfluxDB instead of Graphite at the
> recommendation of some people for performance improvements when querying
> hundreds of jobs at the same time.
> If anyone wants a specific time to meet, just e-mail me directly. I will
> be at
> the SC17 convention center all week.
> Nicholas McCollum
> HPC Systems Administrator
> Alabama Supercomputer Authority
> On Tue, Nov 14, 2017 at 11:12:46AM +0000, Simon Flood wrote:
> > On 14/11/17 10:58, Chris Samuel wrote:
> > > Yup, certainly interest here!
> > Ditto.
> > --
> > Simon Flood
> > HPC System Administrator
> > University of Cambridge Information Services
> > United Kingdom
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users