[slurm-users] Graphing job metrics

Wed Nov 15 11:06:18 MST 2017

I've been tracking down people at SC17 and talking about graphing user jobs with them.  There's a definite consensus that I should be using influxdb to store the data.  After SC17 I'm going to rebuild my setup and write a better how-to.

The advantage of my current setup is the only requirement is to be running Slurm with cgroups.

The better and more scalable solution is to have it written in C and managed by the slurmd process on the nodes themselves.

I think I may provide a Dockerfile later that will spin everything up automatically.  Then the only requirement is a crontab entry to run a shell script on your nodes to push data to your Docker instance.

Carlos, I'd definitely like to take a look at your setup, especially if you can segregate users so they cannot see another users job metrics.

Nick McCollum

Sent from Nine<http://www.9folders.com/>
________________________________
From: Carlos Fenoy <minibit at gmail.com>
Sent: Nov 15, 2017 10:15 AM
To: Slurm User Community List
Subject: Re: [slurm-users] Graphing job metrics

Hi,

I developed a plugin around 1.5 years ago that uses the profiling feature of slurm to collect the resource usage information and sends it to influxdb. This is not yet merged in the official slurm release, but it may be in the next 18.x release. If you want to test this there is a branch in the schedm github repo (https://github.com/SchedMD/slurm/tree/influxdb)

We've had this running since I created it in some mid-sized clusters with 10's of thousands of jobs per day without an issue. We have a retention policy of 7 days in influx to avoid collecting too much data. We provide then a grafana dashboard for the users where they can filter by jobid to see the CPU usage and Memory usage of their jobs.

If you need more details, I'll be glad to answer your questions.

Regards,
Carlos

On Tue, Nov 14, 2017 at 6:10 PM, Nicholas McCollum <nmccollum at asc.edu<mailto:nmccollum at asc.edu>> wrote:
All,

I went to the SchedMD booth last night and talked with the guys.  Tim told me
that the Barcelona Supercomputing Center is working on something similar.  I am
going to try to meet with their Slurm person and compare notes.

I'm also going to look into trying InfluxDB instead of Graphite at the
recommendation of some people for performance improvements when querying
hundreds of jobs at the same time.

If anyone wants a specific time to meet, just e-mail me directly.  I will be at
the SC17 convention center all week.

---

Nicholas McCollum
HPC Systems Administrator
Alabama Supercomputer Authority

On Tue, Nov 14, 2017 at 11:12:46AM +0000, Simon Flood wrote:
> On 14/11/17 10:58, Chris Samuel wrote:
>
> > Yup, certainly interest here!
>
> Ditto.
> --
> Simon Flood
> HPC System Administrator
> University of Cambridge Information Services
> United Kingdom
>

--
--
Carles Fenoy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171115/dc05dac7/attachment.html>