[slurm-users] priority/multifactor, sshare, and AccountingStorageEnforce

Thu Jul 9 19:16:28 UTC 2020

Try setting RawShares to something greater than 1.  I've seen it be the 
case then when you set 1 it creates weirdness like this.

-Paul Edmon-

On 7/9/2020 1:12 PM, Dumont, Joey wrote:
>
> Hi,
>
>
> We recently set up fair tree scheduling (we have 19.05 running), and 
> are trying to use sshare to see usage information. Unfortunately, 
> sshare reports all zeros, even though there seems to be data in the 
> backend DB. Here's an example output:
>
>
> $ sshare -l
>            Account       User  RawShares  NormShares RawUsage  
>  NormUsage  EffectvUsage  FairShare    LevelFS               
> GrpTRESMins                    TRESRunMins
> -------------------- ---------- ---------- ----------- ----------- 
> ----------- ------------- ---------- ---------- 
> ------------------------------ ------------------------------
> root                                                            0     
>             0.000000              0.000000                 
> cpu=0,mem=0,energy=0,node=0,b+
>  covid                                  1                       0     
>           0.000000              0.000000               
> cpu=0,mem=0,energy=0,node=0,b+
> covid-01                               1  0                  0.000000  
>             0.000000                     cpu=0,mem=0,energy=0,node=0,b+
> covid-02                               1  0                  0.000000  
>             0.000000                     cpu=0,mem=0,energy=0,node=0,b+
>  group1                                 1                       0     
>         0.000000              0.000000             
> cpu=0,mem=0,energy=0,node=0,b+
> subgroup1                              1  0                  0.000000  
>             0.000000                     cpu=0,mem=0,energy=0,node=0,b+
>  othersubgroups                        1  0                  0.000000  
>             0.000000                     cpu=0,mem=0,energy=0,node=0,b+
>  othersubgroups                        1  0                  0.000000  
>             0.000000                     cpu=0,mem=0,energy=0,node=0,b+
>  othersubgroups                        1  0                  0.000000  
>             0.000000                     cpu=0,mem=0,energy=0,node=0,b+
>  othersubgroups                        1  0                  0.000000  
>             0.000000                     cpu=0,mem=0,energy=0,node=0,b+
>  othersubgroups                        1  0                  0.000000  
>             0.000000                     cpu=0,mem=0,energy=0,node=0,b+
>  othersubgroups                        1  0                  0.000000  
>             0.000000                     cpu=0,mem=0,energy=0,node=0,b+
>  othersubgroups                        1  0                  0.000000  
>             0.000000                     cpu=0,mem=0,energy=0,node=0,b+
>  othersubgroups                        1  0                  0.000000  
>             0.000000                     cpu=0,mem=0,energy=0,node=0,b+
>  othersubgroups                        1  0                  0.000000  
>             0.000000                     cpu=0,mem=0,energy=0,node=0,b+
>  othersubgroups                        1  0                  0.000000  
>             0.000000                     cpu=0,mem=0,energy=0,node=0,b+
> subgroups                              1  0                  0.000000  
>             0.000000                     cpu=0,mem=0,energy=0,node=0,b+
> subgroups                              4  0                  0.000000  
>             0.000000                     cpu=0,mem=0,energy=0,node=0,b+
> subgroups                              1  0                  0.000000  
>             0.000000                     cpu=0,mem=0,energy=0,node=0,b+
>  SUBGROUP                               1                       0     
>       0.000000              0.000000           
> cpu=0,mem=0,energy=0,node=0,b+
>  SUBGROUP                               1                       0     
>       0.000000              0.000000           
> cpu=0,mem=0,energy=0,node=0,b+
>
>
>
> And the slurm.conf config:
>
>
> ClusterName=trixie
> SlurmctldHost=trixie(10.10.0.11)
> SlurmctldHost=hn2(10.10.0.12)
> GresTypes=gpu
> SlurmUser=slurm
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> StateSaveLocation=/gpfs/share/slurm/
> SlurmdSpoolDir=/var/spool/slurm/d
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmdPidFile=/var/run/slurmd.pid
> ProctrackType=proctrack/cgroup
> ReturnToService=2
> PrologFlags=x11
> TaskPlugin=task/cgroup
>
> # TIMERS
> SlurmctldTimeout=60
> SlurmdTimeout=300
> InactiveLimit=0
> MinJobAge=300
> KillWait=30
> Waittime=0
> #
>
> # SCHEDULING
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
> FastSchedule=1
>
> SchedulerParameters=bf_interval=60,bf_continue,bf_resolution=600,bf_window=2880,bf_max_job_test=5000,bf_max_job_part=1000,bf_max_job_user=10,bf_max_job_start=100
>
> PriorityType=priority/multifactor
> PriorityDecayHalfLife=14-0
> PriorityWeightFairshare=100000
> PriorityWeightAge=1000
> PriorityWeightPartition=10000
> PriorityWeightJobSize=1000
> PriorityMaxAge=1-0
>
> # LOGGING
> SlurmctldDebug=3
> SlurmctldLogFile=/var/log/slurmctld.log
> SlurmdDebug=3
> SlurmdLogFile=/var/log/slurmd.log
> JobCompType=jobcomp/none
>
> # ACCOUNTING
> JobAcctGatherType=jobacct_gather/linux
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageHost=hn2
> AccountingStorageTRES=gres/gpu
>
> # COMPUTE NODES
> NodeName=cn[101-136] Procs=32 Gres=gpu:4 RealMemory=192782
>
> # Partitions
> PartitionName=JobTesting Nodes=cn[135-136] MaxTime=02:00:00 
> DefaultTime=00:30:00 MaxMemPerNode=192782 
> AllowGroups=DT-AI4DCluster-All State=UP
> PartitionName=TrixieMain Nodes=cn[106-134] MaxTime=48:00:00 
> DefaultTime=08:00:00 MaxMemPerNode=192782 
> AllowGroups=DT-AI4DCluster-All State=UP Default=YES
> PartitionName=ItOpsTests Nodes=cn[102-105] MaxTime=INFINITE 
> MaxMemPerNode=192782 AllowGroups=Admin-Access,Manager-Access State=UP
> PartitionName=ItOpsImage Nodes=cn101 MaxTime=INFINITE 
> MaxMemPerNode=192782 AllowGroups=Admin-Access State=UP
>
> Anything that would explain sshare returns only zeros?
>
> The only particularity I can think of is that I don't think we 
> reloaded slurmctld, but just reconfigured.
>
>
> Cheers,
>
>
> Joey Dumont
>
> Technical Advisor, Knowledge, Information, and Technology Services
> National Research Council Canada / Governement of Canada
> joey.dumont at nrc-cnrc.gc.ca <mailto:joey.dumont at nrc-cnrc.gc.ca> / Tel: 
> 613-990-8152 / Cell: 438-340-7436
>
> Conseiller technique, Services du savoir, de l'information et de la 
> technologie
> Conseil national de recherches Canada / Gouvernement du Canada
> joey.dumont at nrc-cnrc.gc.ca <mailto:joey.dumont at nrc-cnrc.gc.ca> / Tél.: 
> 613-990-8152 / Tél. cell.: 438-340-7436
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200709/e335a7bc/attachment.htm>