Slurm Reporting Difference between sreport and sacct

List overview All Threads
Download

newer

older

Account getting duplicated

Crash in "slurmd -C" when latest...

Passant Hafez

21 May 2025 21 May '25

4:48 p.m.

Hi all,

I was wondering if someone can help explaining this discrepancy.

I have different values for project gpu consumption using sreport vs sacct (+ some calculations)

This is an example that shows this:

sreport -t hours -T gres/gpu cluster AccountUtilizationByuser start=2025-04-01 end=2025-04-05 | grep project1234 gives 178 while sacct -n -X --allusers --accounts=project1234 --start=2025-04-01 --end=2025-04-05 -o elapsedraw,AllocTRES%80,user,partition

gives 213480 billing=128,cpu=128,gres/gpu=8,mem=1000G,node=2 gpuplus 249507 billing=128,cpu=128,gres/gpu=8,mem=1000G,node=2 gpuplus 13908 billing=64,cpu=64,gres/gpu=4,mem=500G,node=1 gpuplus 9552 billing=64,cpu=64,gres/gpu=4,mem=500G,node=1 gpuplus 4 billing=16,cpu=16,gres/gpu=1,mem=200G,node=1 gpu 11 billing=16,cpu=16,gres/gpu=1,mem=200G,node=1 gpu ...

I will not bore you with the full output and its calculation, but the first job alone consumed 213480 seconds/60/60 * 8 gpus that's 474.4 gpu hours which is way more than the 178 hrs reported by sreport

Any clue why these are inconsistent? or how sreport calculated the 178 value?

All the best, Passant

Attachments:

attachment.html (text/html — 6.8 KB)

Show replies by date

Steen Lysgaard

22 May 22 May

8:15 a.m.

Hi Passant,

I've found that when using sacct to track resource usage over specific time periods, it's helpful to include the --truncate option. Without it, jobs that started before the specified start time will have their entire runtime counted, including time outside the specified range. The --truncate option ensures that only the time within the defined period is included. Maybe this can explain some of the discrepancy you experience.

Best regards, Steen

________________________________ From: Passant Hafez via slurm-users slurm-users@lists.schedmd.com Sent: Wednesday, May 21, 2025 18:48 To: 'slurm-users@schedmd.com' slurm-users@schedmd.com Subject: [slurm-users] Slurm Reporting Difference between sreport and sacct

Hi all,

I was wondering if someone can help explaining this discrepancy.

I have different values for project gpu consumption using sreport vs sacct (+ some calculations)

This is an example that shows this:

Any clue why these are inconsistent? or how sreport calculated the 178 value?

All the best, Passant

Passant Hafez

23 May 23 May

12:32 p.m.

Hi Steen,

Thanks a lot, that certainly sorted out most of the discrepancies!

I'm still having some differences though for the sreport and saact output for certain accounts so was wondering if there's anything else I'm missing in how sreport calculates it (for sacct I use cputimeraw and sum it and convert to hrs)

All the best, Passant ________________________________ From: Steen Lysgaard stly@dtu.dk Sent: Thursday, May 22, 2025 9:15 AM To: 'slurm-users@schedmd.com' slurm-users@schedmd.com; Passant Hafez Passant.Hafez@glasgow.ac.uk Subject: Re: Slurm Reporting Difference between sreport and sacct

Hi Passant,

Best regards, Steen

Hi all,

I was wondering if someone can help explaining this discrepancy.

I have different values for project gpu consumption using sreport vs sacct (+ some calculations)

This is an example that shows this:

Any clue why these are inconsistent? or how sreport calculated the 178 value?

All the best, Passant

Paul Raines

1:22 p.m.

But I also find things inconsistent with just sreport itself.

I run:

sreport -T Gres/gpu cluster Utilization Start=01/01/25 End=04/30/25

Allocated Down PLND Dow Idle Planned Reported --------- -------- -------- --------- -------- ---------- 15310868 451198 0 8607344 0 24369410

Doing the same for each of the first four months in the range above individually gives

Month Allocated Down PLND Dow Idle Planned Reported -------- --------- -------- -------- --------- -------- --------- Jan 3398309 324071 0 2430336 0 6152716 Feb 7712527 448009 0 3717620 0 11878156 Mar 2995147 745 0 3129989 0 6125880 Apr 4371832 2444 0 1582138 0 5956414

and adding those 4 Allocated numbers together gives 18477815 > 15310868

So I assume there is NO truncation going on here and those month numbers are including all time of jobs that ran for anytime in that month but also time in previous or next month.

-- Paul Raines (http://help.nmr.mgh.harvard.edu)

On Fri, 23 May 2025 8:32am, Passant Hafez via slurm-users wrote:

...

   External Email - Use Caution
Hi Steen,

Thanks a lot, that certainly sorted out most of the discrepancies!

I'm still having some differences though for the sreport and saact output for certain accounts so was wondering if there's anything else I'm missing in how sreport calculates it (for sacct I use cputimeraw and sum it and convert to hrs)

All the best, Passant ________________________________ From: Steen Lysgaard stly@dtu.dk Sent: Thursday, May 22, 2025 9:15 AM To: 'slurm-users@schedmd.com' slurm-users@schedmd.com; Passant Hafez Passant.Hafez@glasgow.ac.uk Subject: Re: Slurm Reporting Difference between sreport and sacct

Hi Passant,

I've found that when using sacct to track resource usage over specific time periods, it's helpful to include the --truncate option. Without it, jobs that started before the specified start time will have their entire runtime counted, including time outside the specified range. The --truncate option ensures that only the time within the defined period is included. Maybe this can explain some of the discrepancy you experience.

Best regards, Steen

From: Passant Hafez via slurm-users slurm-users@lists.schedmd.com Sent: Wednesday, May 21, 2025 18:48 To: 'slurm-users@schedmd.com' slurm-users@schedmd.com Subject: [slurm-users] Slurm Reporting Difference between sreport and sacct

Hi all,

I was wondering if someone can help explaining this discrepancy.

I have different values for project gpu consumption using sreport vs sacct (+ some calculations)

This is an example that shows this:

sreport -t hours -T gres/gpu cluster AccountUtilizationByuser start=2025-04-01 end=2025-04-05 | grep project1234 gives 178 while sacct -n -X --allusers --accounts=project1234 --start=2025-04-01 --end=2025-04-05 -o elapsedraw,AllocTRES%80,user,partition

gives 213480 billing=128,cpu=128,gres/gpu=8,mem=1000G,node=2 gpuplus 249507 billing=128,cpu=128,gres/gpu=8,mem=1000G,node=2 gpuplus 13908 billing=64,cpu=64,gres/gpu=4,mem=500G,node=1 gpuplus 9552 billing=64,cpu=64,gres/gpu=4,mem=500G,node=1 gpuplus 4 billing=16,cpu=16,gres/gpu=1,mem=200G,node=1 gpu 11 billing=16,cpu=16,gres/gpu=1,mem=200G,node=1 gpu ...

I will not bore you with the full output and its calculation, but the first job alone consumed 213480 seconds/60/60 * 8 gpus that's 474.4 gpu hours which is way more than the 178 hrs reported by sreport

Any clue why these are inconsistent? or how sreport calculated the 178 value?

All the best, Passant

The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.

Age (days ago)

Last active (days ago)

slurm-users@lists.schedmd.com

3 comments

3 participants

tags (0)

participants (3)

Passant Hafez
Paul Raines
Steen Lysgaard