Hi guys,
We've just setup our new cluster and are facing some issues
regading fairshare calculation.
Our slurm directive regarding priority calculation are defines as
follows:
PriorityType=priority/multifactor
PriorityFlags=MAX_TRES
PriorityDecayHalfLife=14-0
PriorityFavorSmall=NO
PriorityMaxAge=14-0
PriorityWeightAge=1000
PriorityWeightJobSize=1000
PriorityWeightPartition=10000000
PriorityWeightQOS=10000000
PriorityWeightTRES=CPU=2000,Mem=4000
PriorityWeightFairshare=100000
The partition we are submitinh out jobs to is setup as follows:
PartitionName=mypart Priority=1000 TRESBillingWeights="CPU=1.0,Mem=0.25G" Default=YES MaxTime=96:0:0 DefMemPerCPU=5333 Nodes=node[001-036] MaxNodes=20
Whenever we take a look at the fairshare value using sshare -l we
see the following output:
Account User RawShares NormShares RawUsage NormUsage EffectvUsage FairShare LevelFS GrpTRESMins TRESRunMins
-------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- ------------------------------ ------------------------------
root 1 0.000000 268724597 0.000000 0.000000 cpu=1098201,mem=5856709132,en+
root root 1 0.100000 0 0.000000 0.000000 0.000000 cpu=0,mem=0,energy=0,node=0,b+
group1 1 0.100000 0 0.000000 0.000000 cpu=0,mem=0,energy=0,node=0,b+
group2 1 0.100000 0 0.000000 0.000000 cpu=0,mem=0,energy=0,node=0,b+
group3 1 0.100000 268724597 0.000000 0.000000 cpu=1098201,mem=5856709132,en+
group4 1 0.100000 0 0.000000 0.000000 cpu=0,mem=0,energy=0,node=0,b+
group5 1 0.100000 0 0.000000 0.000000 cpu=0,mem=0,energy=0,node=0,b+
group6 1 0.100000 0 0.000000 0.000000 cpu=0,mem=0,energy=0,node=0,b+
group7 1 0.100000 0 0.000000 0.000000 cpu=0,mem=0,energy=0,node=0,b+
group8 1 0.100000 0 0.000000 0.000000 cpu=0,mem=0,energy=0,node=0,b+
group9 1 0.100000 0 0.000000 0.000000 cpu=0,mem=0,energy=0,node=0,b+
We think it is really weird that the FairShare value is 0 for the
root account and "NULL" for all other groups, even the one who had
the greatest raw usage.
While taking a look at the data for our users we see the
following:
Account User RawShares NormShares RawUsage EffectvUsage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
root 1 0.000000 268983721 0.000000
root root 1 0.100000 0 0.000000 0.000000
group3 1 0.100000 268983721 0.000000
group3 user1 1 0.090909 12109374 0.000000 0.000000
group3 user2 1 0.090909 0 0.000000 0.000000
group3 user3 1 0.090909 0 0.000000 0.000000
group3 user4 1 0.090909 0 0.000000 0.000000
group3 user5 1 0.090909 0 0.000000 0.000000
group3 user6 1 0.090909 0 0.000000 0.000000
group3 user7 1 0.090909 0 0.000000 0.000000
group3 user8 1 0.090909 208824597 0.000000 0.000000
group3 user9 1 0.090909 0 0.000000 0.000000
group3 user10 1 0.090909 0 0.000000 0.000000
group3 user11 1 0.090909 48049750 0.000000 0.000000
group4 1 0.100000 0 0.000000
group4 user13 1 0.000000 499452 0.000000 0.000000
group5 1 0.100000 0 0.000000
group5 user14 1 0.000000 1539603 0.000000 0.000000
This is a weird behavior, since user1, user8, user11, user13 and
user14 are the ones who have more RawUsage and the FairShare value
is the same for all of them, including the users that have no yet
submited any job.
We also noticed that in the slurmctld log there is the fillowing
error message that appears with some regularity
[2024-03-07T16:38:13.260] error: _append_list_to_array: unable to
append NULL list to assoc list.
[2024-03-07T16:38:13.260] error: _calc_tree_fs: unable to
calculate fairshare on empty tree
The error above looks like it is coming from:
https://github.com/SchedMD/slurm/blob/b11bf689b270f1f5dfe4b0cd54c4fa84b4af315b/src/plugins/priority/multifactor/fair_tree.c#L337
Are we missing any setting on slurm.conf? This is kind of
strange, because we have another cluster with pretty much the same
configuration and the FairShare is calculated without any
problems.
Any help would be appreciated.
--
Cumprimentos / Best Regards,
Zacarias Benta
LIP/INCD @ UMINHO
----------------------------------------------
/ Use linux, and may the source be with you. /
----------------------------------------------
\ __
-=(o '.
'.-.\
/| \\
'| ||
_\_):,_