[slurm-users] priority/multifactor, sshare, and AccountingStorageEnforce
Paul Edmon
pedmon at cfa.harvard.edu
Thu Jul 9 19:16:28 UTC 2020
Try setting RawShares to something greater than 1. I've seen it be the
case then when you set 1 it creates weirdness like this.
-Paul Edmon-
On 7/9/2020 1:12 PM, Dumont, Joey wrote:
>
> Hi,
>
>
> We recently set up fair tree scheduling (we have 19.05 running), and
> are trying to use sshare to see usage information. Unfortunately,
> sshare reports all zeros, even though there seems to be data in the
> backend DB. Here's an example output:
>
>
> $ sshare -l
> Account User RawShares NormShares RawUsage
> NormUsage EffectvUsage FairShare LevelFS
> GrpTRESMins TRESRunMins
> -------------------- ---------- ---------- ----------- -----------
> ----------- ------------- ---------- ----------
> ------------------------------ ------------------------------
> root 0
> 0.000000 0.000000
> cpu=0,mem=0,energy=0,node=0,b+
> covid 1 0
> 0.000000 0.000000
> cpu=0,mem=0,energy=0,node=0,b+
> covid-01 1 0 0.000000
> 0.000000 cpu=0,mem=0,energy=0,node=0,b+
> covid-02 1 0 0.000000
> 0.000000 cpu=0,mem=0,energy=0,node=0,b+
> group1 1 0
> 0.000000 0.000000
> cpu=0,mem=0,energy=0,node=0,b+
> subgroup1 1 0 0.000000
> 0.000000 cpu=0,mem=0,energy=0,node=0,b+
> othersubgroups 1 0 0.000000
> 0.000000 cpu=0,mem=0,energy=0,node=0,b+
> othersubgroups 1 0 0.000000
> 0.000000 cpu=0,mem=0,energy=0,node=0,b+
> othersubgroups 1 0 0.000000
> 0.000000 cpu=0,mem=0,energy=0,node=0,b+
> othersubgroups 1 0 0.000000
> 0.000000 cpu=0,mem=0,energy=0,node=0,b+
> othersubgroups 1 0 0.000000
> 0.000000 cpu=0,mem=0,energy=0,node=0,b+
> othersubgroups 1 0 0.000000
> 0.000000 cpu=0,mem=0,energy=0,node=0,b+
> othersubgroups 1 0 0.000000
> 0.000000 cpu=0,mem=0,energy=0,node=0,b+
> othersubgroups 1 0 0.000000
> 0.000000 cpu=0,mem=0,energy=0,node=0,b+
> othersubgroups 1 0 0.000000
> 0.000000 cpu=0,mem=0,energy=0,node=0,b+
> othersubgroups 1 0 0.000000
> 0.000000 cpu=0,mem=0,energy=0,node=0,b+
> subgroups 1 0 0.000000
> 0.000000 cpu=0,mem=0,energy=0,node=0,b+
> subgroups 4 0 0.000000
> 0.000000 cpu=0,mem=0,energy=0,node=0,b+
> subgroups 1 0 0.000000
> 0.000000 cpu=0,mem=0,energy=0,node=0,b+
> SUBGROUP 1 0
> 0.000000 0.000000
> cpu=0,mem=0,energy=0,node=0,b+
> SUBGROUP 1 0
> 0.000000 0.000000
> cpu=0,mem=0,energy=0,node=0,b+
>
>
>
> And the slurm.conf config:
>
>
> ClusterName=trixie
> SlurmctldHost=trixie(10.10.0.11)
> SlurmctldHost=hn2(10.10.0.12)
> GresTypes=gpu
> SlurmUser=slurm
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> StateSaveLocation=/gpfs/share/slurm/
> SlurmdSpoolDir=/var/spool/slurm/d
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmdPidFile=/var/run/slurmd.pid
> ProctrackType=proctrack/cgroup
> ReturnToService=2
> PrologFlags=x11
> TaskPlugin=task/cgroup
>
> # TIMERS
> SlurmctldTimeout=60
> SlurmdTimeout=300
> InactiveLimit=0
> MinJobAge=300
> KillWait=30
> Waittime=0
> #
>
> # SCHEDULING
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
> FastSchedule=1
>
> SchedulerParameters=bf_interval=60,bf_continue,bf_resolution=600,bf_window=2880,bf_max_job_test=5000,bf_max_job_part=1000,bf_max_job_user=10,bf_max_job_start=100
>
> PriorityType=priority/multifactor
> PriorityDecayHalfLife=14-0
> PriorityWeightFairshare=100000
> PriorityWeightAge=1000
> PriorityWeightPartition=10000
> PriorityWeightJobSize=1000
> PriorityMaxAge=1-0
>
> # LOGGING
> SlurmctldDebug=3
> SlurmctldLogFile=/var/log/slurmctld.log
> SlurmdDebug=3
> SlurmdLogFile=/var/log/slurmd.log
> JobCompType=jobcomp/none
>
> # ACCOUNTING
> JobAcctGatherType=jobacct_gather/linux
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageHost=hn2
> AccountingStorageTRES=gres/gpu
>
> # COMPUTE NODES
> NodeName=cn[101-136] Procs=32 Gres=gpu:4 RealMemory=192782
>
> # Partitions
> PartitionName=JobTesting Nodes=cn[135-136] MaxTime=02:00:00
> DefaultTime=00:30:00 MaxMemPerNode=192782
> AllowGroups=DT-AI4DCluster-All State=UP
> PartitionName=TrixieMain Nodes=cn[106-134] MaxTime=48:00:00
> DefaultTime=08:00:00 MaxMemPerNode=192782
> AllowGroups=DT-AI4DCluster-All State=UP Default=YES
> PartitionName=ItOpsTests Nodes=cn[102-105] MaxTime=INFINITE
> MaxMemPerNode=192782 AllowGroups=Admin-Access,Manager-Access State=UP
> PartitionName=ItOpsImage Nodes=cn101 MaxTime=INFINITE
> MaxMemPerNode=192782 AllowGroups=Admin-Access State=UP
>
> Anything that would explain sshare returns only zeros?
>
> The only particularity I can think of is that I don't think we
> reloaded slurmctld, but just reconfigured.
>
>
> Cheers,
>
>
> Joey Dumont
>
> Technical Advisor, Knowledge, Information, and Technology Services
> National Research Council Canada / Governement of Canada
> joey.dumont at nrc-cnrc.gc.ca <mailto:joey.dumont at nrc-cnrc.gc.ca> / Tel:
> 613-990-8152 / Cell: 438-340-7436
>
> Conseiller technique, Services du savoir, de l'information et de la
> technologie
> Conseil national de recherches Canada / Gouvernement du Canada
> joey.dumont at nrc-cnrc.gc.ca <mailto:joey.dumont at nrc-cnrc.gc.ca> / Tél.:
> 613-990-8152 / Tél. cell.: 438-340-7436
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200709/e335a7bc/attachment.htm>
More information about the slurm-users
mailing list