Just following up on my own message in case someone else is trying to figure out RawUsage and Fair Share. 

I ran some additional tests, except that I ran jobs for 10 min instead of 1 min. The procedure was

1.  Set the accounting stats to update every minute in slurm.conf

PriorityCalcPeriod=1

2. Reset the RawUsage stat

sacctmgr modify account luchko_group set RawUsage=0

3. Check the RawUsage every second

 while sleep 1; do date; sshare -ao Account,User,RawShares,NormShares,RawUsage ; done > watch.out

4. Run a 10 min job.  The billing per CPU is 1, so the total RawUsage should 60,000 and the RawUsage should increase 6,000 each minute

sbatch --account=luchko_group --wrap="sleep 600" -p cpu -n 100

Scanning the output file, I can see that the RawUsage does update once every minute.  Below are the updates. (I've removed irrelevant output.)

Tue Sep 24 10:14:24 AM PDT 2024
Account                    User  RawShares  NormShares    RawUsage
-------------------- ---------- ---------- ----------- -----------
  luchko_group          tluchko        100    0.500000           0
Tue Sep 24 10:14:25 AM PDT 2024  
  luchko_group          tluchko        100    0.500000        4099
Tue Sep 24 10:15:24 AM PDT 2024
  luchko_group          tluchko        100    0.500000       10099
Tue Sep 24 10:16:25 AM PDT 2024
  luchko_group          tluchko        100    0.500000       16099
Tue Sep 24 10:17:24 AM PDT 2024
  luchko_group          tluchko        100    0.500000       22098
Tue Sep 24 10:18:25 AM PDT 2024
  luchko_group          tluchko        100    0.500000       28097
Tue Sep 24 10:19:24 AM PDT 2024
  luchko_group          tluchko        100    0.500000       34096
Tue Sep 24 10:20:25 AM PDT 2024
  luchko_group          tluchko        100    0.500000       40094
Tue Sep 24 10:21:24 AM PDT 2024
  luchko_group          tluchko        100    0.500000       46093
Tue Sep 24 10:22:25 AM PDT 2024
  luchko_group          tluchko        100    0.500000       52091
Tue Sep 24 10:23:24 AM PDT 2024
  luchko_group          tluchko        100    0.500000       58089
Tue Sep 24 10:24:25 AM PDT 2024
 luchko_group                         2000    0.133324       58087
Tue Sep 24 10:25:25 AM PDT 2024
  luchko_group          tluchko        100    0.500000       58085

So, the RawUsage does increase by the expected amount each minute, and the RawUsage does decay (I have the half-life set to 14 days).  However, the update for the last part of a minute, which should be 1901, is not recorded.  I suspect this is because the job is no longer running when the accounting update occurs.

For typical jobs that run for hours or days, this is a negligible error, but it does explain the results I got when I ran a 1 min job.

TRESRunMins is still not updating, but this is an inconvenience.

Tyler

Sent with Proton Mail secure email.

On Thursday, September 19th, 2024 at 8:47 PM, tluchko via slurm-users <slurm-users@lists.schedmd.com> wrote:
Hello,

I'm hoping someone can offer some suggestions.

I went ahead started the database from scratch and reinitialized it to see if that would help and to try and understand how RawUsage is calculated.  I ran two jobs of

sbatch --account=luchko_group --wrap="sleep 60" -p cpu -n 100

With the partition defined as 

PriorityFlags=MAX_TRES
PartitionName=cpu Nodes=node[1-7] MaxCPUsPerNode=182 MaxTime=7-0:00:00 State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"

I expected each job to contribute 6000 to the RawUsage, however one job contributed 3100 and the other 2800.  And TRESRunMins stayed at 0 for all categories.

I'm at a loss as to what is going on.

Thank you,

Tyler

Sent with Proton Mail secure email.

On Tuesday, September 10th, 2024 at 9:03 PM, tluchko <tluchko@protonmail.com> wrote:
Hello,

We have a new cluster and I'm trying to setup fairshare accounting.  I'm trying to track CPU, MEM and GPU.  It seems that billing for individual jobs is correct, but billing isn't being accumulated (TRESRunMin is always 0).

In my slurm.conf, I think the relevant lines are

AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageTRES=gres/gpu

PriorityFlags=MAX_TRES

PartitionName=gpu Nodes=node[1-7] MaxCPUsPerNode=384 MaxTime=7-0:00:00 State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"
PartitionName=cpu Nodes=node[1-7] MaxCPUsPerNode=182 MaxTime=7-0:00:00 State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"

I currently have one recently finished job and one running job.  sacct gives

$ sacct --format=JobID,JobName,ReqTRES%50,AllocTRES%50,TRESUsageInAve%50,TRESUsageInMax%50
JobID           JobName                                            ReqTRES                                          AllocTRES                                     TRESUsageInAve                                     TRESUsageInMax
------------ ---------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- --------------------------------------------------
154          interacti+           billing=9,cpu=1,gres/gpu=1,mem=1G,node=1           billing=9,cpu=2,gres/gpu=1,mem=2G,node=1
154.interac+ interacti+                                                                        cpu=2,gres/gpu=1,mem=2G,node=1 cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+ cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+
155          interacti+           billing=9,cpu=1,gres/gpu=1,mem=1G,node=1           billing=9,cpu=2,gres/gpu=1,mem=2G,node=1
155.interac+ interacti+                                                                        cpu=2,gres/gpu=1,mem=2G,node=1

billing=9 seems correct to me, since I have 1 GPU allocated, which has the largest score of 9.6.  However, sshare doesn't show anything in TRESRunMins

sshare --format=Account,User,RawShares,FairShare,RawUsage,EffectvUsage,TRESRunMins%110
Account                    User  RawShares  FairShare    RawUsage  EffectvUsage                                                                                                    TRESRunMins
-------------------- ---------- ---------- ---------- ----------- ------------- --------------------------------------------------------------------------------------------------------------
root                                                     21589714      1.000000         cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
 abrol_group                          2000                      0      0.000000         cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
 luchko_group                         2000               21589714      1.000000         cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
  luchko_group          tluchko          1   0.333333    21589714      1.000000         cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0

Why is TRESRunMin all 0 but RawUsage is not for tluchko? I have checked and slurmdbd is running.

Thank you,

Tyler
Sent with Proton Mail secure email.