I ran some additional tests, except that I ran jobs for 10 min instead of 1 min. The procedure was

1. Set the accounting stats to update every minute in slurm.conf

PriorityCalcPeriod=1

2. Reset the RawUsage stat

sacctmgr modify account luchko_group set RawUsage=0

3. Check the RawUsage every second

while sleep 1; do date; sshare -ao Account,User,RawShares,NormShares,RawUsage ; done > watch.out

4. Run a 10 min job. The billing per CPU is 1, so the total RawUsage should 60,000 and the RawUsage should increase 6,000 each minute

sbatch --account=luchko_group --wrap="sleep 600" -p cpu -n 100

Scanning the output file, I can see that the RawUsage does update once every minute. Below are the updates. (I've removed irrelevant output.)

Tue Sep 24 10:14:24 AM PDT 2024

Account User RawShares NormShares RawUsage

-------------------- ---------- ---------- ----------- -----------
luchko_group tluchko 100 0.500000 0

Tue Sep 24 10:14:25 AM PDT 2024

luchko_group tluchko 100 0.500000 4099
Tue Sep 24 10:15:24 AM PDT 2024

luchko_group tluchko 100 0.500000 10099

Tue Sep 24 10:16:25 AM PDT 2024

luchko_group tluchko 100 0.500000 16099

Tue Sep 24 10:17:24 AM PDT 2024

luchko_group tluchko 100 0.500000 22098

Tue Sep 24 10:18:25 AM PDT 2024

luchko_group tluchko 100 0.500000 28097

Tue Sep 24 10:19:24 AM PDT 2024

luchko_group tluchko 100 0.500000 34096

Tue Sep 24 10:20:25 AM PDT 2024

luchko_group tluchko 100 0.500000 40094

Tue Sep 24 10:21:24 AM PDT 2024

luchko_group tluchko 100 0.500000 46093

Tue Sep 24 10:22:25 AM PDT 2024

luchko_group tluchko 100 0.500000 52091

Tue Sep 24 10:23:24 AM PDT 2024

luchko_group tluchko 100 0.500000 58089

Tue Sep 24 10:24:25 AM PDT 2024

luchko_group 2000 0.133324 58087

Tue Sep 24 10:25:25 AM PDT 2024

luchko_group tluchko 100 0.500000 58085

So, the RawUsage does increase by the expected amount each minute, and the RawUsage does decay (I have the half-life set to 14 days). However, the update for the last part of a minute, which should be 1901, is not recorded. I suspect this is because the job is no longer running when the accounting update occurs.

For typical jobs that run for hours or days, this is a negligible error, but it does explain the results I got when I ran a 1 min job.

TRESRunMins is still not updating, but this is an inconvenience.

Tyler

On Thursday, September 19th, 2024 at 8:47 PM, tluchko via slurm-users <slurm-users@lists.schedmd.com> wrote:

Hello,

I'm hoping someone can offer some suggestions.

I went ahead started the database from scratch and reinitialized it to see if that would help and to try and understand how RawUsage is calculated. I ran two jobs of

sbatch --account=luchko_group --wrap="sleep 60" -p cpu -n 100

With the partition defined as

PriorityFlags=MAX_TRES
PartitionName=cpu Nodes=node[1-7] MaxCPUsPerNode=182 MaxTime=7-0:00:00 State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"

I expected each job to contribute 6000 to the RawUsage, however one job contributed 3100 and the other 2800. And TRESRunMins stayed at 0 for all categories.

I'm at a loss as to what is going on.

Thank you,

Tyler

Sent with Proton Mail secure email.

On Tuesday, September 10th, 2024 at 9:03 PM, tluchko <tluchko@protonmail.com> wrote:

Hello,

We have a new cluster and I'm trying to setup fairshare accounting. I'm trying to track CPU, MEM and GPU. It seems that billing for individual jobs is correct, but billing isn't being accumulated (TRESRunMin is always 0).

In my slurm.conf, I think the relevant lines are

AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageTRES=gres/gpu

PriorityFlags=MAX_TRES

PartitionName=gpu Nodes=node[1-7] MaxCPUsPerNode=384 MaxTime=7-0:00:00 State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"
PartitionName=cpu Nodes=node[1-7] MaxCPUsPerNode=182 MaxTime=7-0:00:00 State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"

I currently have one recently finished job and one running job. sacct gives

$ sacct --format=JobID,JobName,ReqTRES%50,AllocTRES%50,TRESUsageInAve%50,TRESUsageInMax%50
JobID JobName ReqTRES AllocTRES TRESUsageInAve TRESUsageInMax
------------ ---------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- --------------------------------------------------
154 interacti+ billing=9,cpu=1,gres/gpu=1,mem=1G,node=1 billing=9,cpu=2,gres/gpu=1,mem=2G,node=1
154.interac+ interacti+ cpu=2,gres/gpu=1,mem=2G,node=1 cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+ cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+
155 interacti+ billing=9,cpu=1,gres/gpu=1,mem=1G,node=1 billing=9,cpu=2,gres/gpu=1,mem=2G,node=1
155.interac+ interacti+ cpu=2,gres/gpu=1,mem=2G,node=1

billing=9 seems correct to me, since I have 1 GPU allocated, which has the largest score of 9.6. However, sshare doesn't show anything in TRESRunMins

sshare --format=Account,User,RawShares,FairShare,RawUsage,EffectvUsage,TRESRunMins%110
Account User RawShares FairShare RawUsage EffectvUsage TRESRunMins
-------------------- ---------- ---------- ---------- ----------- ------------- --------------------------------------------------------------------------------------------------------------
root 21589714 1.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
abrol_group 2000 0 0.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
luchko_group 2000 21589714 1.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
luchko_group tluchko 1 0.333333 21589714 1.000000 cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0

Why is TRESRunMin all 0 but RawUsage is not for tluchko? I have checked and slurmdbd is running.

Thank you,

Tyler

Sent with Proton Mail secure email.