<div dir="ltr">Hi Paul, <div><br></div><div>I'm wondering about this part in your SchedulerParameters:<br><br>### default_queue_depth should be some multiple of the partition_job_depth,<br>### ideally number_of_partitions * partition_job_depth, but typically the main<br>### loop exits prematurely if you go over about 400. A partition_job_depth of<br>### 10 seems to work well.<br><br>Do you remember if that's still the case, or if it's in relation with a reported issue? That sure sounds like something that would need to be fixed if it hasn't been already.</div><div><br></div><div>Cheers,</div><div>-- </div><div>Kilian</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, May 29, 2019 at 7:42 AM Paul Edmon <<a href="mailto:pedmon@cfa.harvard.edu">pedmon@cfa.harvard.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  
  <div bgcolor="#FFFFFF">

    <p>For reference we are running 18.08.7</p>

    <p>-Paul Edmon-<br>

    </p>

    <div class="gmail-m_-4707203304347826915moz-cite-prefix">On 5/29/19 10:39 AM, Paul Edmon wrote:<br>

    </div>

    <blockquote type="cite">

      
      <p>Sure.  Here is what we have:</p>

      <p>########################## Scheduling

        #####################################<br>

        ### This section is specific to scheduling<br>

        <br>

        ### Tells the scheduler to enforce limits for all partitions<br>

        ### that a job submits to.<br>

        EnforcePartLimits=ALL<br>

        <br>

        ### Let's slurm know that we have a jobsubmit.lua script<br>

        JobSubmitPlugins=lua<br>

        <br>

        ### When a job is launched this has slurmctld send the user

        information<br>

        ### instead of having AD do the lookup on the node itself.<br>

        LaunchParameters=send_gids<br>

        <br>

        ### Maximum sizes for Jobs.<br>

        MaxJobCount=200000<br>

        MaxArraySize=10000<br>

        DefMemPerCPU=100<br>

        <br>

        ### Job Timers<br>

        CompleteWait=0<br>

        <br>

        ### We set the EpilogMsgTime long so that Epilog Messages don't

        pile up all <br>

        ### at one time due to forced exit which can cause problems for

        the master.<br>

        EpilogMsgTime=3000000<br>

        InactiveLimit=0<br>

        KillWait=30<br>

        <br>

        ### This only applies to the reservation time limit, the job

        must still obey<br>

        ### the partition time limit.<br>

        ResvOverRun=UNLIMITED<br>

        MinJobAge=600<br>

        Waittime=0<br>

        <br>

        ### Scheduling parameters<br>

        ### FastSchedule 2 lets slurm know not to auto detect the node

        config<br>

        ### but rather follow our definition.  We also use setting 2 as

        due to our geographic<br>

        ### size nodes may drop out of slurm and then reconnect.  If we

        had 1 they would be<br>

        ### set to drain when they reconnect.  Setting it to 2 allows

        them to rejoin with out<br>

        ### issue.<br>

        FastSchedule=2<br>

        SchedulerType=sched/backfill<br>

        SelectType=select/cons_res<br>

        SelectTypeParameters=CR_Core_Memory<br>

        <br>

        ### Govern's default preemption behavior<br>

        PreemptType=preempt/partition_prio<br>

        PreemptMode=REQUEUE<br>

        <br>

        ### default_queue_depth should be some multiple of the

        partition_job_depth,<br>

        ### ideally number_of_partitions * partition_job_depth, but

        typically the main<br>

        ### loop exits prematurely if you go over about 400. A

        partition_job_depth of<br>

        ### 10 seems to work well.<br>

        SchedulerParameters=\<br>

        default_queue_depth=1150,\<br>

        partition_job_depth=10,\<br>

        max_sched_time=50,\<br>

        bf_continue,\<br>

        bf_interval=30,\<br>

        bf_resolution=600,\<br>

        bf_window=11520,\<br>

        bf_max_job_part=0,\<br>

        bf_max_job_user=10,\<br>

        bf_max_job_test=10000,\<br>

        bf_max_job_start=1000,\<br>

        bf_ignore_newly_avail_nodes,\<br>

        kill_invalid_depend,\<br>

        pack_serial_at_end,\<br>

        nohold_on_prolog_fail,\<br>

        preempt_strict_order,\<br>

        preempt_youngest_first,\<br>

        max_rpc_cnt=8<br>

        <br>

        ################################ Fairshare

        ################################<br>

        ### This section sets the fairshare calculations<br>

        <br>

        PriorityType=priority/multifactor<br>

        <br>

        ### Settings for fairshare calculation frequency and shape.<br>

        FairShareDampeningFactor=1<br>

        PriorityDecayHalfLife=28-0<br>

        PriorityCalcPeriod=1<br>

        <br>

        ### Settings for fairshare weighting.<br>

        PriorityMaxAge=7-0<br>

        PriorityWeightAge=10000000<br>

        PriorityWeightFairshare=20000000<br>

        PriorityWeightJobSize=0<br>

        PriorityWeightPartition=0<br>

        PriorityWeightQOS=1000000000</p>

      <p>I'm happy to chat about any of the settings if you want, or

        share our full config.</p>

      <p>-Paul Edmon-<br>

      </p>

      <div class="gmail-m_-4707203304347826915moz-cite-prefix">On 5/29/19 10:17 AM, Julius, Chad

        wrote:<br>

      </div>

      <blockquote type="cite">

        
        <div class="gmail-m_-4707203304347826915WordSection1">

          <p class="MsoNormal">All, <u></u><u></u></p>

          <p class="MsoNormal"><u></u> <u></u></p>

          <p class="MsoNormal">We rushed our Slurm install due to a

            short timeframe and missed some important items.  We are now

            looking to implement a better system than the first in,

            first out we have now.  My question, are the defaults listed

            in the slurm.conf file a good start?  Would anyone be

            willing to share their Scheduling section in their .conf? 

            Also we are looking to increase the maximum array size but I

            don’t see that in the slurm.conf in version 17.  Am I

            looking at an upgrade of Slurm in the near future or can I

            just add MaxArraySize=somenumber?<u></u><u></u></p>

          <p class="MsoNormal"><u></u> <u></u></p>

          <p class="MsoNormal">The defaults as of 17.11.8 are:<u></u><u></u></p>

          <p class="MsoNormal"><u></u> <u></u></p>

          <p class="MsoNormal"># SCHEDULING<u></u><u></u></p>

          <p class="MsoNormal">#SchedulerAuth=<u></u><u></u></p>

          <p class="MsoNormal">#SchedulerPort=<u></u><u></u></p>

          <p class="MsoNormal">#SchedulerRootFilter=<u></u><u></u></p>

          <p class="MsoNormal">#PriorityType=priority/multifactor<u></u><u></u></p>

          <p class="MsoNormal">#PriorityDecayHalfLife=14-0<u></u><u></u></p>

          <p class="MsoNormal">#PriorityUsageResetPeriod=14-0<u></u><u></u></p>

          <p class="MsoNormal">#PriorityWeightFairshare=100000<u></u><u></u></p>

          <p class="MsoNormal">#PriorityWeightAge=1000<u></u><u></u></p>

          <p class="MsoNormal">#PriorityWeightPartition=10000<u></u><u></u></p>

          <p class="MsoNormal">#PriorityWeightJobSize=1000<u></u><u></u></p>

          <p class="MsoNormal">#PriorityMaxAge=1-0<u></u><u></u></p>

          <p class="MsoNormal"><u></u> <u></u></p>

          <p class="MsoNormal"><b>Chad Julius</b><u></u><u></u></p>

          <p class="MsoNormal">Cyberinfrastructure Engineer Specialist<u></u><u></u></p>

          <p class="MsoNormal"><u></u> <u></u></p>

          <p class="MsoNormal"><b>Division of Technology & Security</b><u></u><u></u></p>

          <p class="MsoNormal">SOHO 207, Box 2231<u></u><u></u></p>

          <p class="MsoNormal">Brookings, SD 57007<u></u><u></u></p>

          <p class="MsoNormal">Phone: 605-688-5767<u></u><u></u></p>

          <p class="MsoNormal"> <u></u><u></u></p>

          <p class="MsoNormal"><a href="http://www.sdstate.edu/" target="_blank"><span style="color:rgb(5,99,193)">www.sdstate.edu</span></a><u></u><u></u></p>

          <p class="MsoNormal"><img style="width: 2.6041in; height: 0.75in;" id="gmail-m_-4707203304347826915Picture_x0020_1" src="cid:16b041d422d4cff311" alt="cid:image007.png@01D24AF4.6CEECA30" width="250" height="72" border="0"><u></u><u></u></p>

          <p class="MsoNormal"><u></u> <u></u></p>

        </div>

      </blockquote>

    </blockquote>

  </div>


</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature">Kilian</div>