Hi guys,
We've just setup our new cluster and are facing some issues regading fairshare calculation. Our slurm directive regarding priority calculation are defines as follows:
PriorityType=priority/multifactor PriorityFlags=MAX_TRES PriorityDecayHalfLife=14-0 PriorityFavorSmall=NO PriorityMaxAge=14-0 PriorityWeightAge=1000 PriorityWeightJobSize=1000 PriorityWeightPartition=10000000 PriorityWeightQOS=10000000 PriorityWeightTRES=CPU=2000,Mem=4000 PriorityWeightFairshare=100000
The partition we are submitinh out jobs to is setup as follows:
PartitionName=mypartPriority=1000TRESBillingWeights="CPU=1.0,Mem=0.25G"Default=YESMaxTime=96:0:0DefMemPerCPU=5333Nodes=node[001-036] MaxNodes=20
Whenever we take a look at the fairshare value using sshare -l we see the following output:
AccountUserRawSharesNormSharesRawUsageNormUsageEffectvUsageFairShareLevelFSGrpTRESMinsTRESRunMins ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- root10.0000002687245970.0000000.000000cpu=1098201,mem=5856709132,en+ rootroot10.10000000.0000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group110.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group210.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group310.1000002687245970.0000000.000000cpu=1098201,mem=5856709132,en+ group410.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group510.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group610.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group710.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group810.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+ group910.10000000.0000000.000000cpu=0,mem=0,energy=0,node=0,b+
We think it is really weird that the FairShare value is 0 for the root account and "NULL" for all other groups, even the one who had the greatest raw usage.
While taking a look at the data for our users we see the following:
AccountUserRawSharesNormSharesRawUsageEffectvUsageFairShare ------------------------------------------------------------------------------------- root10.0000002689837210.000000 rootroot10.10000000.0000000.000000 group310.1000002689837210.000000 group3user110.090909121093740.0000000.000000 group3user210.09090900.0000000.000000 group3user310.09090900.0000000.000000 group3user410.09090900.0000000.000000 group3user510.09090900.0000000.000000 group3user610.09090900.0000000.000000 group3user710.09090900.0000000.000000 group3user810.0909092088245970.0000000.000000 group3user910.09090900.0000000.000000 group3user1010.09090900.0000000.000000 group3user1110.090909480497500.0000000.000000 group410.10000000.000000 group4user1310.0000004994520.0000000.000000 group510.10000000.000000 group5user1410.00000015396030.0000000.000000
This is a weird behavior, since user1, user8, user11, user13 and user14 are the ones who have more RawUsage and the FairShare value is the same for all of them, including the users that have no yet submited any job.
We also noticed that in the slurmctld log there is the fillowing error message that appears with some regularity
[2024-03-07T16:38:13.260] error: _append_list_to_array: unable to append NULL list to assoc list. [2024-03-07T16:38:13.260] error: _calc_tree_fs: unable to calculate fairshare on empty tree
The error above looks like it is coming from: https://github.com/SchedMD/slurm/blob/b11bf689b270f1f5dfe4b0cd54c4fa84b4af31...
Are we missing any setting on slurm.conf? This is kind of strange, because we have another cluster with pretty much the same configuration and the FairShare is calculated without any problems. Any help would be appreciated.