[slurm-users] slurm reporting

Renfro, Michael Renfro at tntech.edu
Tue Nov 26 16:39:01 UTC 2019


Once you added enough to ingest the Slurm logs into Influx or whatever, it could be similar. XDMoD already has the pieces in place to dig through your hierarchy of PIs, users, etc. Plus some built-in queries for correlating job size to wait time, for example:

[cid:0F1CF9CC-D46B-4464-A386-5C5BF11B59D9 at tntech.edu]

I’ve also started using XDMoD as my data source for some short one-slide presentations, where I extract out a graph of the historical resource usage and overlay our total job count and total CPU-hours used.

On Nov 26, 2019, at 10:21 AM, Ricardo Gregorio <ricardo.gregorio at rothamsted.ac.uk<mailto:ricardo.gregorio at rothamsted.ac.uk>> wrote:

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

________________________________

Mike,

It sounds interesting...In fact I had come across XDMoD this morning while "searching" for further info...

Would Grafana do similar job as XDMoD?



-----Original Message-----
From: slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> On Behalf Of Renfro, Michael
Sent: 26 November 2019 16:14
To: Slurm User Community List <slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>>
Subject: Re: [slurm-users] slurm reporting

• Total number of jobs submitted by user (daily/weekly/monthly)
• Average queue time per user (daily/weekly/monthly)
• Average job run time per user (daily/weekly/monthly)

Open XDMoD for these three. https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com<http://2Fgithub.com>%2Fubccr%2Fxdmod&data=01%7C01%7Cricardo.gregorio%40rothamsted.ac.uk<http://40rothamsted.ac.uk>%7C460de352693741c7399508d7728bfb68%7Cb688362589414342b0e37b8cc8392f64%7C1&sdata=ePMpRET56c241GOCIU%2Bt3qMkR1vDUeFHv9DLKNb0cVo%3D&reserved=0 , plus https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fxdmod.ccr.buffalo.edu<http://2Fxdmod.ccr.buffalo.edu>&data=01%7C01%7Cricardo.gregorio%40rothamsted.ac.uk<http://40rothamsted.ac.uk>%7C460de352693741c7399508d7728bfb68%7Cb688362589414342b0e37b8cc8392f64%7C1&sdata=DkFnQBRkfAkzpIb6naqsPWXiVvBoRpC1zNr8CRsRpRA%3D&reserved=0 (unfortunately their SSL certificate expired yesterday, so you’ll get a warning).

• %time partitions were in-use and idle

Not sure how you’d want to define this, plus our partitions have substantial overlap on resources (our partitions are primarily to separate GPU or large memory jobs from others, and to balance priorities and limits on different classes of jobs).

• min/mx/avg number of nodes/cpus/mem used per user/job

Open XDMoD for CPUs and nodes, and probably Open XDMoD plus SUPREMM for memory (haven’t used this one myself, but I plan to).

--
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601     / Tennessee Tech University

On Nov 26, 2019, at 10:02 AM, Ricardo Gregorio <ricardo.gregorio at rothamsted.ac.uk<mailto:ricardo.gregorio at rothamsted.ac.uk>> wrote:

External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
Hi all,

I am new to both HPC and SLURM.

I have been trying to run some usage reports (using sreport and sacct); but I cannot find a way to get the following info:

• Total number of jobs submitted by user (daily/weekly/monthly)
• Average queue time per user (daily/weekly/monthly)
• Average job run time per user (daily/weekly/monthly)
• %time partitions were in-use and idle
• min/mx/avg number of nodes/cpus/mem used per user/job

Is this doable?

Regards,
Ricardo Gregorio
Research and Systems Administrator


Rothamsted Research is a company limited by guarantee, registered in England at Harpenden, Hertfordshire, AL5 2JQ under the registration number 2393175 and a not for profit charity number 802038.


Rothamsted Research is a company limited by guarantee, registered in England at Harpenden, Hertfordshire, AL5 2JQ under the registration number 2393175 and a not for profit charity number 802038.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191126/7020e83d/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Wait_Hours__Per_Job__by_Job_Size_2019-10-01_to_2019-10-31_timeseries.png
Type: image/png
Size: 51441 bytes
Desc: Wait_Hours__Per_Job__by_Job_Size_2019-10-01_to_2019-10-31_timeseries.png
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191126/7020e83d/attachment-0001.png>


More information about the slurm-users mailing list