[slurm-users] slurm bank and sreport tres minute usage problem

Mon Mar 15 13:16:52 UTC 2021

Hi Paul,

Thank you for your reply. Good to know that in your case you get consistent replies. I had done a similar analises.
Starting with a user I got from the accounting records:

sacct -X -u rsantos  --starttime=2020-01-01 --endtime=now -o jobid,part,account,start,end,elapsed,alloctres%80 | grep "gres/gpu"
1473                gpu       tsrp 2020-12-23T22:37:46 2020-12-23T23:31:22   00:53:36                                         billing=8,cpu=8,gres/gpu=1,mem=2G,node=1 
1488                gpu       tsrp 2020-12-23T23:35:58 2020-12-23T23:37:51   00:01:53                                         billing=8,cpu=8,gres/gpu=1,mem=2G,node=1 
1499                gpu       tsrp 2020-12-23T23:39:19 2020-12-23T23:44:21   00:05:02                                         billing=8,cpu=8,gres/gpu=1,mem=2G,node=1 
2066                gpu       tsrp 2020-12-24T01:32:32 2020-12-25T08:01:43 1-06:29:11                       billing=2,cpu=2,energy=16514193,gres/gpu=1,mem=512M,node=1 
2993                gpu       tsrp 2020-12-29T22:36:13 2020-12-29T22:38:03   00:01:50                            billing=8,cpu=8,energy=12032,gres/gpu=1,mem=2G,node=1 

To prove that this user is the only one in this account with gpu usage I also did this query in terms of accounts:

sacct -X -A tsrp -a --starttime=2020-01-01 --endtime=now -o user,jobid,part,account,start,end,elapsed,alloctres%80  | grep "gres/gpu"
  rsantos 1473                gpu       tsrp 2020-12-23T22:37:46 2020-12-23T23:31:22   00:53:36                                         billing=8,cpu=8,gres/gpu=1,mem=2G,node=1 
  rsantos 1488                gpu       tsrp 2020-12-23T23:35:58 2020-12-23T23:37:51   00:01:53                                         billing=8,cpu=8,gres/gpu=1,mem=2G,node=1 
  rsantos 1499                gpu       tsrp 2020-12-23T23:39:19 2020-12-23T23:44:21   00:05:02                                         billing=8,cpu=8,gres/gpu=1,mem=2G,node=1 
  rsantos 2066                gpu       tsrp 2020-12-24T01:32:32 2020-12-25T08:01:43 1-06:29:11                       billing=2,cpu=2,energy=16514193,gres/gpu=1,mem=512M,node=1 
  rsantos 2993                gpu       tsrp 2020-12-29T22:36:13 2020-12-29T22:38:03   00:01:50                            billing=8,cpu=8,energy=12032,gres/gpu=1,mem=2G,node=1 

This adds up to 1891 minutes. Querying the association I can confirm this value:

scontrol -o show assoc_mgr | grep ^QOS=tsrp | grep -oP '(?<=GrpTRESMins=).[^ ]*'
cpu=24000000(8769901),mem=N(8687005243),energy=N(0),node=N(201275),billing=N(8769901),fs/disk=N(0),vmem=N(0),pages=N(0),fs/lustre=N(0),gres/gpu=N(1891),ic/ofed=N(0)

If I now use sreport I get a totally different number:

sreport -t minutes -T gres/gpu -nP cluster AccountUtilizationByUser start=2020-01-01 end=now account=tsrp format=login,used
|62
rsantos|62

I cannot understand why this is the case and if there is a situation in which these different answers can be understood.
Just to prove this is not a slurm version issue I even updated my slurm to version 20.11.4 to no avail!

Hope someone else can jump in here and give me some pointers!

Best Regards,

MAO 

> On 12 Mar 2021, at 19:25, Paul Raines <raines at nmr.mgh.harvard.edu> wrote:
> 
> 
> Very new to SLURM and have not used sreport before so I decided to
> try your searches myself to see what they do.
> 
> I am running 20.11.3 and it seems to match the data for me for a very
> simple case I tested that I could "eyeball"
> 
> Looking just at the day 2021-03-09 for user mu40 on account lcn
> 
> # sreport -t minutes -T CPU -nP cluster \
>  AccountUtilizationByUser start='2021-03-09' end='2021-03-10' \
>  account=lcn format=login,used
> |40333
> cx88|33835
> mu40|6498
> 
> # sreport -t minutes -T gres/gpu -nP cluster \
>  AccountUtilizationByUser start='2021-03-09' end='2021-03-10' \
>  account=lcn format=login,used
> |13070
> cx88|9646
> mu40|3425
> 
> # sacct --user=mu40 --starttime=2021-03-09 --endtime=2021-03-10 \
>  --account=lcn -o jobid,start,end,elapsed,alloctres%80
> 
>       JobID               Start                 End    Elapsed
> AllocTRES
> ------------ ------------------- ------------------- ----------
> -----------------------------------------------------
> 190682       2021-03-05T16:25:55 2021-03-12T09:20:52 6-16:54:57
> billing=10,cpu=3,gres/gpu=2,mem=24G,node=1
> 190682.batch 2021-03-05T16:25:55 2021-03-12T09:20:53 6-16:54:58
> cpu=3,gres/gpu=2,mem=24G,node=1
> 190682.exte+ 2021-03-05T16:25:55 2021-03-12T09:20:52 6-16:54:57
> billing=10,cpu=3,gres/gpu=2,mem=24G,node=1
> 201123       2021-03-09T14:55:20 2021-03-09T14:55:23   00:00:03
> billing=9,cpu=4,gres/gpu=1,mem=96G,node=1
> 201123.exte+ 2021-03-09T14:55:20 2021-03-09T14:55:23   00:00:03
> billing=9,cpu=4,gres/gpu=1,mem=96G,node=1
> 201123.0     2021-03-09T14:55:20 2021-03-09T14:55:23   00:00:03
> cpu=4,gres/gpu=1,mem=96G,node=1
> 201124       2021-03-09T14:55:29 2021-03-10T08:13:07   17:17:38
> billing=18,cpu=4,gres/gpu=1,mem=512G,node=1
> 201124.exte+ 2021-03-09T14:55:29 2021-03-10T08:13:07   17:17:38
> billing=18,cpu=4,gres/gpu=1,mem=512G,node=1
> 201124.0     2021-03-09T14:55:29 2021-03-10T08:13:07   17:17:38
> cpu=4,gres/gpu=1,mem=512G,node=1
> 
> So the first job used all 24 hours of that day, the 2nd just 3 seconds
> (so ignore it) and the third about 9 hours and 5 minutes
> 
> CPU = 24*60*3+(9*60+5)*4 = 6500
> 
> GPU = 24*60*2+(9*60+5)*1 = 3425
> 
> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
> 
> On Thu, 11 Mar 2021 11:03pm, Miguel Oliveira wrote:
> 
>> Dear all,
>> 
>> Hope you can help me!
>> In our facility we support the users via projects that have time allocations. Given this we use a simple bank facility developed by us along the ideas of the old code https://jcftang.github.io/slurm-bank/ <https://jcftang.github.io/slurm-bank/>.
>> Our implementation differs because we have a QOS per project with a NoDecay flag. The basic commands used are:
>> - scontrol show assoc_mgr to read the limits,
>> - sacctmgr modify qos to modify the limits and,
>> - sreport to read individual usage.
>> We have been using this for a while in production without any single issues for CPU time allocations.
>> 
>> Now we need to implement GPU time allocation as well for our new GPU partition.
>> While the 2 first commands work fine to set or change the limits with gres/gpu we seem to get values with sreport that do not add up.
>> In this case we use:
>> 
>> - command='sreport -t minutes -T gres/gpu -nP cluster AccountUtilizationByUser start='+date_start+' end='+date_end+' account='+account+' format=login,used'
>> 
>> We have confirmed via the accounting records that the total reported via scontrol show assoc_mgr is correct while the value given by sreport is totally off.
>> Did I misunderstand the sreport man page and the command above is reporting something else or is this a bug?
>> We do something similar with "-T cpu", for the CPU part of the code, and the number match up. We are using slurm 20.02.0.
>> 
>> Best Regards,
>> 
>> MAO
>> 
>> ---
>> Miguel Afonso Oliveira
>> Laboratório de Computação Avançada | Laboratory for Advanced Computing
>> Universidade de Coimbra | University of Coimbra
>> T: +351239410681
>> E: miguel.oliveira at uc.pt
>> W: www.uc.pt/lca

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2384 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210315/c8a77668/attachment.bin>