[slurm-users] Slurm Fairshare / Multifactor Priority

Wed May 29 14:40:38 UTC 2019

For reference we are running 18.08.7

-Paul Edmon-

On 5/29/19 10:39 AM, Paul Edmon wrote:
>
> Sure.  Here is what we have:
>
> ########################## Scheduling 
> #####################################
> ### This section is specific to scheduling
>
> ### Tells the scheduler to enforce limits for all partitions
> ### that a job submits to.
> EnforcePartLimits=ALL
>
> ### Let's slurm know that we have a jobsubmit.lua script
> JobSubmitPlugins=lua
>
> ### When a job is launched this has slurmctld send the user information
> ### instead of having AD do the lookup on the node itself.
> LaunchParameters=send_gids
>
> ### Maximum sizes for Jobs.
> MaxJobCount=200000
> MaxArraySize=10000
> DefMemPerCPU=100
>
> ### Job Timers
> CompleteWait=0
>
> ### We set the EpilogMsgTime long so that Epilog Messages don't pile 
> up all
> ### at one time due to forced exit which can cause problems for the 
> master.
> EpilogMsgTime=3000000
> InactiveLimit=0
> KillWait=30
>
> ### This only applies to the reservation time limit, the job must 
> still obey
> ### the partition time limit.
> ResvOverRun=UNLIMITED
> MinJobAge=600
> Waittime=0
>
> ### Scheduling parameters
> ### FastSchedule 2 lets slurm know not to auto detect the node config
> ### but rather follow our definition.  We also use setting 2 as due to 
> our geographic
> ### size nodes may drop out of slurm and then reconnect.  If we had 1 
> they would be
> ### set to drain when they reconnect.  Setting it to 2 allows them to 
> rejoin with out
> ### issue.
> FastSchedule=2
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
>
> ### Govern's default preemption behavior
> PreemptType=preempt/partition_prio
> PreemptMode=REQUEUE
>
> ### default_queue_depth should be some multiple of the 
> partition_job_depth,
> ### ideally number_of_partitions * partition_job_depth, but typically 
> the main
> ### loop exits prematurely if you go over about 400. A 
> partition_job_depth of
> ### 10 seems to work well.
> SchedulerParameters=\
> default_queue_depth=1150,\
> partition_job_depth=10,\
> max_sched_time=50,\
> bf_continue,\
> bf_interval=30,\
> bf_resolution=600,\
> bf_window=11520,\
> bf_max_job_part=0,\
> bf_max_job_user=10,\
> bf_max_job_test=10000,\
> bf_max_job_start=1000,\
> bf_ignore_newly_avail_nodes,\
> kill_invalid_depend,\
> pack_serial_at_end,\
> nohold_on_prolog_fail,\
> preempt_strict_order,\
> preempt_youngest_first,\
> max_rpc_cnt=8
>
> ################################ Fairshare 
> ################################
> ### This section sets the fairshare calculations
>
> PriorityType=priority/multifactor
>
> ### Settings for fairshare calculation frequency and shape.
> FairShareDampeningFactor=1
> PriorityDecayHalfLife=28-0
> PriorityCalcPeriod=1
>
> ### Settings for fairshare weighting.
> PriorityMaxAge=7-0
> PriorityWeightAge=10000000
> PriorityWeightFairshare=20000000
> PriorityWeightJobSize=0
> PriorityWeightPartition=0
> PriorityWeightQOS=1000000000
>
> I'm happy to chat about any of the settings if you want, or share our 
> full config.
>
> -Paul Edmon-
>
> On 5/29/19 10:17 AM, Julius, Chad wrote:
>>
>> All,
>>
>> We rushed our Slurm install due to a short timeframe and missed some 
>> important items.  We are now looking to implement a better system 
>> than the first in, first out we have now.  My question, are the 
>> defaults listed in the slurm.conf file a good start?  Would anyone be 
>> willing to share their Scheduling section in their .conf? Also we are 
>> looking to increase the maximum array size but I don’t see that in 
>> the slurm.conf in version 17.  Am I looking at an upgrade of Slurm in 
>> the near future or can I just add MaxArraySize=somenumber?
>>
>> The defaults as of 17.11.8 are:
>>
>> # SCHEDULING
>>
>> #SchedulerAuth=
>>
>> #SchedulerPort=
>>
>> #SchedulerRootFilter=
>>
>> #PriorityType=priority/multifactor
>>
>> #PriorityDecayHalfLife=14-0
>>
>> #PriorityUsageResetPeriod=14-0
>>
>> #PriorityWeightFairshare=100000
>>
>> #PriorityWeightAge=1000
>>
>> #PriorityWeightPartition=10000
>>
>> #PriorityWeightJobSize=1000
>>
>> #PriorityMaxAge=1-0
>>
>> *Chad Julius*
>>
>> Cyberinfrastructure Engineer Specialist
>>
>> *Division of Technology & Security*
>>
>> SOHO 207, Box 2231
>>
>> Brookings, SD 57007
>>
>> Phone: 605-688-5767
>>
>> www.sdstate.edu <http://www.sdstate.edu/>
>>
>> cid:image007.png at 01D24AF4.6CEECA30
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190529/8564dc91/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 18266 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190529/8564dc91/attachment-0001.png>