[slurm-users] priority/multifactor, sshare, and AccountingStorageEnforce

Dumont, Joey Joey.Dumont at nrc-cnrc.gc.ca
Thu Jul 9 17:12:46 UTC 2020


Hi,


We recently set up fair tree scheduling (we have 19.05 running), and are trying to use sshare to see usage information. Unfortunately, sshare reports all zeros, even though there seems to be data in the backend DB. Here's an example output:


$ sshare -l
             Account       User  RawShares  NormShares    RawUsage   NormUsage  EffectvUsage  FairShare    LevelFS                    GrpTRESMins                    TRESRunMins
-------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- ------------------------------ ------------------------------
root                                                             0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
 covid                                   1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
  covid-01                               1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
  covid-02                               1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
 group1                                  1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
  subgroup1                              1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
   othersubgroups                        1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
   othersubgroups                        1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
   othersubgroups                        1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
   othersubgroups                        1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
   othersubgroups                        1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
   othersubgroups                        1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
   othersubgroups                        1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
   othersubgroups                        1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
   othersubgroups                        1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
   othersubgroups                        1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
  subgroups                              1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
  subgroups                              4                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
  subgroups                              1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
 SUBGROUP                                1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+
 SUBGROUP                                1                       0                  0.000000              0.000000                                cpu=0,mem=0,energy=0,node=0,b+



And the slurm.conf config:


ClusterName=trixie
SlurmctldHost=trixie(10.10.0.11)
SlurmctldHost=hn2(10.10.0.12)
GresTypes=gpu
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/gpfs/share/slurm/
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/cgroup
ReturnToService=2
PrologFlags=x11
TaskPlugin=task/cgroup

# TIMERS
SlurmctldTimeout=60
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#

# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
FastSchedule=1

SchedulerParameters=bf_interval=60,bf_continue,bf_resolution=600,bf_window=2880,bf_max_job_test=5000,bf_max_job_part=1000,bf_max_job_user=10,bf_max_job_start=100

PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightFairshare=100000
PriorityWeightAge=1000
PriorityWeightPartition=10000
PriorityWeightJobSize=1000
PriorityMaxAge=1-0

# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none

# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=hn2
AccountingStorageTRES=gres/gpu

# COMPUTE NODES
NodeName=cn[101-136] Procs=32 Gres=gpu:4 RealMemory=192782

# Partitions
PartitionName=JobTesting Nodes=cn[135-136] MaxTime=02:00:00 DefaultTime=00:30:00 MaxMemPerNode=192782 AllowGroups=DT-AI4DCluster-All State=UP
PartitionName=TrixieMain Nodes=cn[106-134] MaxTime=48:00:00 DefaultTime=08:00:00 MaxMemPerNode=192782 AllowGroups=DT-AI4DCluster-All State=UP Default=YES
PartitionName=ItOpsTests Nodes=cn[102-105] MaxTime=INFINITE MaxMemPerNode=192782 AllowGroups=Admin-Access,Manager-Access State=UP
PartitionName=ItOpsImage Nodes=cn101 MaxTime=INFINITE MaxMemPerNode=192782 AllowGroups=Admin-Access State=UP

Anything that would explain sshare returns only zeros?


The only particularity I can think of is that I don't think we reloaded slurmctld, but just reconfigured.


Cheers,


Joey Dumont

Technical Advisor, Knowledge, Information, and Technology Services
National Research Council Canada / Governement of Canada
joey.dumont at nrc-cnrc.gc.ca<mailto:joey.dumont at nrc-cnrc.gc.ca> / Tel: 613-990-8152 / Cell: 438-340-7436

Conseiller technique, Services du savoir, de l'information et de la technologie
Conseil national de recherches Canada / Gouvernement du Canada
joey.dumont at nrc-cnrc.gc.ca<mailto:joey.dumont at nrc-cnrc.gc.ca> / Tél.: 613-990-8152 / Tél. cell.: 438-340-7436
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200709/d948c097/attachment-0001.htm>


More information about the slurm-users mailing list