[slurm-users] Slurm Fairshare / Multifactor Priority

Paul Edmon pedmon at cfa.harvard.edu
Wed May 29 15:16:05 UTC 2019


I believe it is still the case, but I haven't tested it.  I put this in 
way back when partition_job_depth was first introduced (which was eons 
ago now).  We run about 100 or so partitions, so this has served us well 
as a general rule.  What happens is that if you set partition job depth 
too deep it may not get through all the partitions before it has to give 
up and start again.  This lead to partition starvation in the past where 
there were jobs waiting to be scheduled in a partition that had space 
but they never started because the main loop never got to them.  The 
backfill loop took to long to clean up thus those jobs took forever to 
schedule.

With the various improvements to the scheduler this may no longer be the 
case, but I haven't taken the time to test it on our cluster as our 
current set up has worked well.

-Paul Edmon-

On 5/29/19 11:04 AM, Kilian Cavalotti wrote:
> Hi Paul,
>
> I'm wondering about this part in your SchedulerParameters:
>
> ### default_queue_depth should be some multiple of the 
> partition_job_depth,
> ### ideally number_of_partitions * partition_job_depth, but typically 
> the main
> ### loop exits prematurely if you go over about 400. A 
> partition_job_depth of
> ### 10 seems to work well.
>
> Do you remember if that's still the case, or if it's in relation with 
> a reported issue? That sure sounds like something that would need to 
> be fixed if it hasn't been already.
>
> Cheers,
> -- 
> Kilian
>
> On Wed, May 29, 2019 at 7:42 AM Paul Edmon <pedmon at cfa.harvard.edu 
> <mailto:pedmon at cfa.harvard.edu>> wrote:
>
>     For reference we are running 18.08.7
>
>     -Paul Edmon-
>
>     On 5/29/19 10:39 AM, Paul Edmon wrote:
>>
>>     Sure.  Here is what we have:
>>
>>     ########################## Scheduling
>>     #####################################
>>     ### This section is specific to scheduling
>>
>>     ### Tells the scheduler to enforce limits for all partitions
>>     ### that a job submits to.
>>     EnforcePartLimits=ALL
>>
>>     ### Let's slurm know that we have a jobsubmit.lua script
>>     JobSubmitPlugins=lua
>>
>>     ### When a job is launched this has slurmctld send the user
>>     information
>>     ### instead of having AD do the lookup on the node itself.
>>     LaunchParameters=send_gids
>>
>>     ### Maximum sizes for Jobs.
>>     MaxJobCount=200000
>>     MaxArraySize=10000
>>     DefMemPerCPU=100
>>
>>     ### Job Timers
>>     CompleteWait=0
>>
>>     ### We set the EpilogMsgTime long so that Epilog Messages don't
>>     pile up all
>>     ### at one time due to forced exit which can cause problems for
>>     the master.
>>     EpilogMsgTime=3000000
>>     InactiveLimit=0
>>     KillWait=30
>>
>>     ### This only applies to the reservation time limit, the job must
>>     still obey
>>     ### the partition time limit.
>>     ResvOverRun=UNLIMITED
>>     MinJobAge=600
>>     Waittime=0
>>
>>     ### Scheduling parameters
>>     ### FastSchedule 2 lets slurm know not to auto detect the node config
>>     ### but rather follow our definition.  We also use setting 2 as
>>     due to our geographic
>>     ### size nodes may drop out of slurm and then reconnect.  If we
>>     had 1 they would be
>>     ### set to drain when they reconnect.  Setting it to 2 allows
>>     them to rejoin with out
>>     ### issue.
>>     FastSchedule=2
>>     SchedulerType=sched/backfill
>>     SelectType=select/cons_res
>>     SelectTypeParameters=CR_Core_Memory
>>
>>     ### Govern's default preemption behavior
>>     PreemptType=preempt/partition_prio
>>     PreemptMode=REQUEUE
>>
>>     ### default_queue_depth should be some multiple of the
>>     partition_job_depth,
>>     ### ideally number_of_partitions * partition_job_depth, but
>>     typically the main
>>     ### loop exits prematurely if you go over about 400. A
>>     partition_job_depth of
>>     ### 10 seems to work well.
>>     SchedulerParameters=\
>>     default_queue_depth=1150,\
>>     partition_job_depth=10,\
>>     max_sched_time=50,\
>>     bf_continue,\
>>     bf_interval=30,\
>>     bf_resolution=600,\
>>     bf_window=11520,\
>>     bf_max_job_part=0,\
>>     bf_max_job_user=10,\
>>     bf_max_job_test=10000,\
>>     bf_max_job_start=1000,\
>>     bf_ignore_newly_avail_nodes,\
>>     kill_invalid_depend,\
>>     pack_serial_at_end,\
>>     nohold_on_prolog_fail,\
>>     preempt_strict_order,\
>>     preempt_youngest_first,\
>>     max_rpc_cnt=8
>>
>>     ################################ Fairshare
>>     ################################
>>     ### This section sets the fairshare calculations
>>
>>     PriorityType=priority/multifactor
>>
>>     ### Settings for fairshare calculation frequency and shape.
>>     FairShareDampeningFactor=1
>>     PriorityDecayHalfLife=28-0
>>     PriorityCalcPeriod=1
>>
>>     ### Settings for fairshare weighting.
>>     PriorityMaxAge=7-0
>>     PriorityWeightAge=10000000
>>     PriorityWeightFairshare=20000000
>>     PriorityWeightJobSize=0
>>     PriorityWeightPartition=0
>>     PriorityWeightQOS=1000000000
>>
>>     I'm happy to chat about any of the settings if you want, or share
>>     our full config.
>>
>>     -Paul Edmon-
>>
>>     On 5/29/19 10:17 AM, Julius, Chad wrote:
>>>
>>>     All,
>>>
>>>     We rushed our Slurm install due to a short timeframe and missed
>>>     some important items.  We are now looking to implement a better
>>>     system than the first in, first out we have now.  My question,
>>>     are the defaults listed in the slurm.conf file a good start? 
>>>     Would anyone be willing to share their Scheduling section in
>>>     their .conf?  Also we are looking to increase the maximum array
>>>     size but I don’t see that in the slurm.conf in version 17.  Am I
>>>     looking at an upgrade of Slurm in the near future or can I just
>>>     add MaxArraySize=somenumber?
>>>
>>>     The defaults as of 17.11.8 are:
>>>
>>>     # SCHEDULING
>>>
>>>     #SchedulerAuth=
>>>
>>>     #SchedulerPort=
>>>
>>>     #SchedulerRootFilter=
>>>
>>>     #PriorityType=priority/multifactor
>>>
>>>     #PriorityDecayHalfLife=14-0
>>>
>>>     #PriorityUsageResetPeriod=14-0
>>>
>>>     #PriorityWeightFairshare=100000
>>>
>>>     #PriorityWeightAge=1000
>>>
>>>     #PriorityWeightPartition=10000
>>>
>>>     #PriorityWeightJobSize=1000
>>>
>>>     #PriorityMaxAge=1-0
>>>
>>>     *Chad Julius*
>>>
>>>     Cyberinfrastructure Engineer Specialist
>>>
>>>     *Division of Technology & Security*
>>>
>>>     SOHO 207, Box 2231
>>>
>>>     Brookings, SD 57007
>>>
>>>     Phone: 605-688-5767
>>>
>>>     www.sdstate.edu <http://www.sdstate.edu/>
>>>
>>>     cid:image007.png at 01D24AF4.6CEECA30
>>>
>
>
> -- 
> Kilian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190529/358f4ec1/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 18266 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190529/358f4ec1/attachment-0001.png>


More information about the slurm-users mailing list