[slurm-users] slurm bank and sreport tres minute usage problem
Miguel Oliveira
miguel.oliveira at uc.pt
Mon Mar 15 13:16:52 UTC 2021
Hi Paul,
Thank you for your reply. Good to know that in your case you get consistent replies. I had done a similar analises.
Starting with a user I got from the accounting records:
sacct -X -u rsantos --starttime=2020-01-01 --endtime=now -o jobid,part,account,start,end,elapsed,alloctres%80 | grep "gres/gpu"
1473 gpu tsrp 2020-12-23T22:37:46 2020-12-23T23:31:22 00:53:36 billing=8,cpu=8,gres/gpu=1,mem=2G,node=1
1488 gpu tsrp 2020-12-23T23:35:58 2020-12-23T23:37:51 00:01:53 billing=8,cpu=8,gres/gpu=1,mem=2G,node=1
1499 gpu tsrp 2020-12-23T23:39:19 2020-12-23T23:44:21 00:05:02 billing=8,cpu=8,gres/gpu=1,mem=2G,node=1
2066 gpu tsrp 2020-12-24T01:32:32 2020-12-25T08:01:43 1-06:29:11 billing=2,cpu=2,energy=16514193,gres/gpu=1,mem=512M,node=1
2993 gpu tsrp 2020-12-29T22:36:13 2020-12-29T22:38:03 00:01:50 billing=8,cpu=8,energy=12032,gres/gpu=1,mem=2G,node=1
To prove that this user is the only one in this account with gpu usage I also did this query in terms of accounts:
sacct -X -A tsrp -a --starttime=2020-01-01 --endtime=now -o user,jobid,part,account,start,end,elapsed,alloctres%80 | grep "gres/gpu"
rsantos 1473 gpu tsrp 2020-12-23T22:37:46 2020-12-23T23:31:22 00:53:36 billing=8,cpu=8,gres/gpu=1,mem=2G,node=1
rsantos 1488 gpu tsrp 2020-12-23T23:35:58 2020-12-23T23:37:51 00:01:53 billing=8,cpu=8,gres/gpu=1,mem=2G,node=1
rsantos 1499 gpu tsrp 2020-12-23T23:39:19 2020-12-23T23:44:21 00:05:02 billing=8,cpu=8,gres/gpu=1,mem=2G,node=1
rsantos 2066 gpu tsrp 2020-12-24T01:32:32 2020-12-25T08:01:43 1-06:29:11 billing=2,cpu=2,energy=16514193,gres/gpu=1,mem=512M,node=1
rsantos 2993 gpu tsrp 2020-12-29T22:36:13 2020-12-29T22:38:03 00:01:50 billing=8,cpu=8,energy=12032,gres/gpu=1,mem=2G,node=1
This adds up to 1891 minutes. Querying the association I can confirm this value:
scontrol -o show assoc_mgr | grep ^QOS=tsrp | grep -oP '(?<=GrpTRESMins=).[^ ]*'
cpu=24000000(8769901),mem=N(8687005243),energy=N(0),node=N(201275),billing=N(8769901),fs/disk=N(0),vmem=N(0),pages=N(0),fs/lustre=N(0),gres/gpu=N(1891),ic/ofed=N(0)
If I now use sreport I get a totally different number:
sreport -t minutes -T gres/gpu -nP cluster AccountUtilizationByUser start=2020-01-01 end=now account=tsrp format=login,used
|62
rsantos|62
I cannot understand why this is the case and if there is a situation in which these different answers can be understood.
Just to prove this is not a slurm version issue I even updated my slurm to version 20.11.4 to no avail!
Hope someone else can jump in here and give me some pointers!
Best Regards,
MAO
> On 12 Mar 2021, at 19:25, Paul Raines <raines at nmr.mgh.harvard.edu> wrote:
>
>
> Very new to SLURM and have not used sreport before so I decided to
> try your searches myself to see what they do.
>
> I am running 20.11.3 and it seems to match the data for me for a very
> simple case I tested that I could "eyeball"
>
> Looking just at the day 2021-03-09 for user mu40 on account lcn
>
> # sreport -t minutes -T CPU -nP cluster \
> AccountUtilizationByUser start='2021-03-09' end='2021-03-10' \
> account=lcn format=login,used
> |40333
> cx88|33835
> mu40|6498
>
> # sreport -t minutes -T gres/gpu -nP cluster \
> AccountUtilizationByUser start='2021-03-09' end='2021-03-10' \
> account=lcn format=login,used
> |13070
> cx88|9646
> mu40|3425
>
> # sacct --user=mu40 --starttime=2021-03-09 --endtime=2021-03-10 \
> --account=lcn -o jobid,start,end,elapsed,alloctres%80
>
> JobID Start End Elapsed
> AllocTRES
> ------------ ------------------- ------------------- ----------
> -----------------------------------------------------
> 190682 2021-03-05T16:25:55 2021-03-12T09:20:52 6-16:54:57
> billing=10,cpu=3,gres/gpu=2,mem=24G,node=1
> 190682.batch 2021-03-05T16:25:55 2021-03-12T09:20:53 6-16:54:58
> cpu=3,gres/gpu=2,mem=24G,node=1
> 190682.exte+ 2021-03-05T16:25:55 2021-03-12T09:20:52 6-16:54:57
> billing=10,cpu=3,gres/gpu=2,mem=24G,node=1
> 201123 2021-03-09T14:55:20 2021-03-09T14:55:23 00:00:03
> billing=9,cpu=4,gres/gpu=1,mem=96G,node=1
> 201123.exte+ 2021-03-09T14:55:20 2021-03-09T14:55:23 00:00:03
> billing=9,cpu=4,gres/gpu=1,mem=96G,node=1
> 201123.0 2021-03-09T14:55:20 2021-03-09T14:55:23 00:00:03
> cpu=4,gres/gpu=1,mem=96G,node=1
> 201124 2021-03-09T14:55:29 2021-03-10T08:13:07 17:17:38
> billing=18,cpu=4,gres/gpu=1,mem=512G,node=1
> 201124.exte+ 2021-03-09T14:55:29 2021-03-10T08:13:07 17:17:38
> billing=18,cpu=4,gres/gpu=1,mem=512G,node=1
> 201124.0 2021-03-09T14:55:29 2021-03-10T08:13:07 17:17:38
> cpu=4,gres/gpu=1,mem=512G,node=1
>
> So the first job used all 24 hours of that day, the 2nd just 3 seconds
> (so ignore it) and the third about 9 hours and 5 minutes
>
> CPU = 24*60*3+(9*60+5)*4 = 6500
>
> GPU = 24*60*2+(9*60+5)*1 = 3425
>
> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>
> On Thu, 11 Mar 2021 11:03pm, Miguel Oliveira wrote:
>
>> Dear all,
>>
>> Hope you can help me!
>> In our facility we support the users via projects that have time allocations. Given this we use a simple bank facility developed by us along the ideas of the old code https://jcftang.github.io/slurm-bank/ <https://jcftang.github.io/slurm-bank/>.
>> Our implementation differs because we have a QOS per project with a NoDecay flag. The basic commands used are:
>> - scontrol show assoc_mgr to read the limits,
>> - sacctmgr modify qos to modify the limits and,
>> - sreport to read individual usage.
>> We have been using this for a while in production without any single issues for CPU time allocations.
>>
>> Now we need to implement GPU time allocation as well for our new GPU partition.
>> While the 2 first commands work fine to set or change the limits with gres/gpu we seem to get values with sreport that do not add up.
>> In this case we use:
>>
>> - command='sreport -t minutes -T gres/gpu -nP cluster AccountUtilizationByUser start='+date_start+' end='+date_end+' account='+account+' format=login,used'
>>
>> We have confirmed via the accounting records that the total reported via scontrol show assoc_mgr is correct while the value given by sreport is totally off.
>> Did I misunderstand the sreport man page and the command above is reporting something else or is this a bug?
>> We do something similar with "-T cpu", for the CPU part of the code, and the number match up. We are using slurm 20.02.0.
>>
>> Best Regards,
>>
>> MAO
>>
>> ---
>> Miguel Afonso Oliveira
>> Laboratório de Computação Avançada | Laboratory for Advanced Computing
>> Universidade de Coimbra | University of Coimbra
>> T: +351239410681
>> E: miguel.oliveira at uc.pt
>> W: www.uc.pt/lca
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2384 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210315/c8a77668/attachment.bin>
More information about the slurm-users
mailing list