Just following up on my own message in case someone else is trying to figure out RawUsage and Fair Share.
I ran some additional tests, except that I ran jobs for 10 min instead of 1 min. The procedure was
1. Set the accounting stats to update every minute in slurm.conf
PriorityCalcPeriod=1
2. Reset the RawUsage stat
sacctmgr modify account luchko_group set RawUsage=0
3. Check the RawUsage every second
while sleep 1; do date; sshare -ao Account,User,RawShares,NormShares,RawUsage ; done > watch.out
4. Run a 10 min job. The billing per CPU is 1, so the total RawUsage should 60,000 and the RawUsage should increase 6,000 each minute
sbatch --account=luchko_group --wrap="sleep 600" -p cpu -n 100
Scanning the output file, I can see that the RawUsage does update once every minute. Below are the updates. (I've removed irrelevant output.)
Tue Sep 24 10:14:24 AM PDT 2024
Account User RawShares NormShares RawUsage
-------------------- ---------- ---------- ----------- -----------
luchko_group tluchko 100 0.500000 0
Tue Sep 24 10:14:25 AM PDT 2024
luchko_group tluchko 100 0.500000 4099Tue Sep 24 10:15:24 AM PDT 2024 luchko_group tluchko 100 0.500000 10099
Tue Sep 24 10:16:25 AM PDT 2024 luchko_group tluchko 100 0.500000 16099Tue Sep 24 10:17:24 AM PDT 2024
luchko_group tluchko 100 0.500000 22098
Tue Sep 24 10:18:25 AM PDT 2024
luchko_group tluchko 100 0.500000 28097
Tue Sep 24 10:19:24 AM PDT 2024
luchko_group tluchko 100 0.500000 34096
Tue Sep 24 10:20:25 AM PDT 2024
luchko_group tluchko 100 0.500000 40094
Tue Sep 24 10:21:24 AM PDT 2024
luchko_group tluchko 100 0.500000 46093
Tue Sep 24 10:22:25 AM PDT 2024
luchko_group tluchko 100 0.500000 52091
Tue Sep 24 10:23:24 AM PDT 2024
luchko_group tluchko 100 0.500000 58089
Tue Sep 24 10:24:25 AM PDT 2024
luchko_group 2000 0.133324 58087
Tue Sep 24 10:25:25 AM PDT 2024
luchko_group tluchko 100 0.500000 58085
So, the RawUsage does increase by the expected amount each minute, and the RawUsage does decay (I have the half-life set to 14 days). However, the update for the last part of a minute, which should be 1901, is not recorded. I suspect this is because the job is no longer running when the accounting update occurs.
For typical jobs that run for hours or days, this is a negligible error, but it does explain the results I got when I ran a 1 min job.
TRESRunMins is still not updating, but this is an inconvenience.
Tyler
On Thursday, September 19th, 2024 at 8:47 PM, tluchko via slurm-users <slurm-users@lists.schedmd.com> wrote:
Hello,
I'm hoping someone can offer some suggestions.
I went ahead started the database from scratch and reinitialized it to see if that would help and to try and understand how RawUsage is calculated. I ran two jobs of
sbatch --account=luchko_group --wrap="sleep 60" -p cpu -n 100
With the partition defined as
PriorityFlags=MAX_TRES
PartitionName=cpu Nodes=node[1-7] MaxCPUsPerNode=182 MaxTime=7-0:00:00 State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"
I expected each job to contribute 6000 to the RawUsage, however one job contributed 3100 and the other 2800. And TRESRunMins stayed at 0 for all categories.
I'm at a loss as to what is going on.
Thank you,
Tyler
On Tuesday, September 10th, 2024 at 9:03 PM, tluchko <tluchko@protonmail.com> wrote:
Hello,
We have a new cluster and I'm trying to setup fairshare accounting. I'm trying to track CPU, MEM and GPU. It seems that billing for individual jobs is correct, but billing isn't being accumulated (TRESRunMin is always 0).
In my slurm.conf, I think the relevant lines are
AccountingStorageType=accounting_storage/slurmdbdAccountingStorageTRES=gres/gpu
PriorityFlags=MAX_TRES
PartitionName=gpu Nodes=node[1-7] MaxCPUsPerNode=384 MaxTime=7-0:00:00 State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"PartitionName=cpu Nodes=node[1-7] MaxCPUsPerNode=182 MaxTime=7-0:00:00 State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"
I currently have one recently finished job and one running job. sacct gives
$ sacct --format=JobID,JobName,ReqTRES%50,AllocTRES%50,TRESUsageInAve%50,TRESUsageInMax%50JobID JobName ReqTRES AllocTRES TRESUsageInAve TRESUsageInMax
------------ ---------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- --------------------------------------------------
154 interacti+ billing=9,cpu=1,gres/gpu=1,mem=1G,node=1 billing=9,cpu=2,gres/gpu=1,mem=2G,node=1
154.interac+ interacti+ cpu=2,gres/gpu=1,mem=2G,node=1 cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+ cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+
155 interacti+ billing=9,cpu=1,gres/gpu=1,mem=1G,node=1 billing=9,cpu=2,gres/gpu=1,mem=2G,node=1
155.interac+ interacti+ cpu=2,gres/gpu=1,mem=2G,node=1
billing=9 seems correct to me, since I have 1 GPU allocated, which has the largest score of 9.6. However, sshare doesn't show anything in TRESRunMins
sshare --format=Account,User,RawShares,FairShare,RawUsage,EffectvUsage,TRESRunMins%110Account User RawShares FairShare RawUsage EffectvUsage TRESRunMins
-------------------- ---------- ---------- ---------- ----------- ------------- --------------------------------------------------------------------------------------------------------------
root 21589714 1.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
abrol_group 2000 0 0.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
luchko_group 2000 21589714 1.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
luchko_group tluchko 1 0.333333 21589714 1.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
Why is TRESRunMin all 0 but RawUsage is not for tluchko? I have checked and slurmdbd is running.
Thank you,
Tyler