[slurm-users] Fairshare config change affect on running/queued jobs?

Fri Apr 30 16:39:37 UTC 2021

Hello everyone,

We wish to deploy "fair share" scheduling configuration and would like to
inquire if we should be aware of effects this might have on jobs already
running or already queued when the config is changed.

The proposed changes are from the example at
https://slurm.schedmd.com/archive/slurm-18.08.9/priority_multifactor.html#config
:

> # Activate the Multi-factor Job Priority Plugin with decay
> PriorityType=priority/multifactor
> # 2 week half-life
> PriorityDecayHalfLife=14-0
> # The larger the job, the greater its job size priority.
> PriorityFavorSmall=NO
> # The job's age factor reaches 1.0 after waiting in the
> # queue for 2 weeks.
> PriorityMaxAge=14-0
> # This next group determines the weighting of each of the
> # components of the Multi-factor Job Priority Plugin.
> # The default value for each of the following is 1.
> PriorityWeightAge=1000
> PriorityWeightFairshare=10000
> PriorityWeightJobSize=1000
> PriorityWeightPartition=1000
> PriorityWeightQOS=0 # don't use the qos factor

We're running SLURM 18.08.8 on CentOS Linux 7.8.2003. The current
slurm.conf is defaults as far as fair share is concerned:

> EnforcePartLimits=ALL
> GresTypes=gpu
> MpiDefault=pmix
> ProctrackType=proctrack/cgroup
> PrologFlags=x11,contain
> PropagateResourceLimitsExcept=MEMLOCK,STACK
> RebootProgram=/sbin/reboot
> ReturnToService=1
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> SlurmdPort=6818
> SlurmdSpoolDir=/var/spool/slurmd
> SlurmUser=slurm
> SlurmdSyslogDebug=verbose
> StateSaveLocation=/var/spool/slurm/ctld
> SwitchType=switch/none
> TaskPlugin=task/cgroup,task/affinity
> TaskPluginParam=Sched
> HealthCheckInterval=300
> HealthCheckProgram=/usr/sbin/nhc
> InactiveLimit=0
> KillWait=30
> MinJobAge=300
> SlurmctldTimeout=120
> SlurmdTimeout=300
> Waittime=0
> DefMemPerCPU=1024
> FastSchedule=1
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
> AccountingStorageHost=sched-db.lan
> AccountingStorageLoc=slurm_acct_db
> AccountingStoragePass=/var/run/munge/munge.socket.2
> AccountingStoragePort=6819
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageUser=slurm
> AccountingStoreJobComment=YES
> AccountingStorageTRES=gres/gpu
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/linux
> SlurmctldDebug=info
> SlurmdDebug=info
> SlurmSchedLogFile=/var/log/slurm/slurmsched.log
> SlurmSchedLogLevel=1

Node and partition configs are omitted above.

Any and all advice will be greatly appreciated.

Best wishes,

~Kevin

Kevin Walsh
Senior Systems Administration Specialist
New Jersey Institute of Technology
Academic & Research Computing Systems
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210430/f6410150/attachment.htm>