<div dir="auto">Thanks Denial for detailed  Description<div dir="auto"><br></div><div dir="auto">Regards </div><div dir="auto">Navin </div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, May 3, 2020, 13:35 Daniel Letai <<a href="mailto:dani@letai.org.il">dani@letai.org.il</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
    
  
  <div>
    <br>
    <div>On 29/04/2020 12:00:13, navin
      srivastava wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">Thanks Daniel.
        <div> </div>
        <div>All jobs went into run state so unable to provide the
          details but definitely will reach out later if we see similar
          issue.</div>
        <div><br>
        </div>
        <div>i am more interested to understand the FIFO with Fair
          Tree.it will be good if anybody provide some insight on this
          combination and also if we will enable the backfilling here
          how the behaviour will change.</div>
        <div><br>
        </div>
        <div>what is the role of the Fair tree here?<br>
        </div>
      </div>
    </blockquote>
    <p>Fair tree is the algorithm used to calculate the interim
      priority, before applying weight, but I think after the halflife
      decay.</p>
    <p><br>
    </p>
    <p>To make it simple - fifo without fairshare would assign priority
      based only on submission time. With faishare, that naive priority
      is adjusted based on prior usage by the applicable entities
      (users/departments - accounts).</p>
    <p><br>
    </p>
    <p>Backfill will let you utilize your resources better, since it
      will allow "inserting" low priority jobs before higher priority
      jobs, provided all jobs have defined wall times, and any inserted
      job doesn't affect in any way the start time of a higher priority
      job, thus allowing utilization of "holes" when the scheduler waits
      for resources to free up, in order to insert some large job.</p>
    <p><br>
    </p>
    <p>Suppose the system is at 60% utilization of cores, and the next
      fifo job requires 42% - it will wait until 2% are free so it can
      begin, meanwhile not allowing any job to start, even if it would
      tke only 30% of the resources (whic are currently free) and would
      finish before the 2% are free anyway.</p>
    <p>Backfill would allow such job to start, as long as it's wall time
      ensures it would finish before the 42% job would've started.</p>
    <p><br>
    </p>
    <p>Fairtree in either case (fifo or backfill) calculates the
      priority for each job the same - if the account had used more
      resources recently (the halflife decay factor) it would get a
      lower priority even though it was submitted earlier than a job
      from an account that didn't use any resources recently.</p>
    <p><br>
    </p>
    <p>As can be expected, backtree has to loop over all jobs in the
      queue, in order to see if any job can fit out of order. In very
      busy/active systems, that can lead to poor response times, unless
      tuned correctly in slurm conf - look at SchedulerParameters, all
      params starting with bf_ and in particular bf_max_job_test=
      ,bf_max_time= and bf_continue (but bf_window= can also have some
      impact if set too high).<br>
    </p>
    <p>see the man page at
      <a href="https://slurm.schedmd.com/slurm.conf.html#OPT_SchedulerParameters" target="_blank" rel="noreferrer">https://slurm.schedmd.com/slurm.conf.html#OPT_SchedulerParameters</a><br>
    </p>
    <blockquote type="cite">
      <div dir="ltr">
        <div><br>
        </div>
        <div>PriorityType=priority/multifactor<br>
        </div>
        <div>PriorityDecayHalfLife=2<br>
          PriorityUsageResetPeriod=DAILY<br>
          PriorityWeightFairshare=500000<br>
          PriorityFlags=FAIR_TREE<br>
        </div>
        <div><br>
        </div>
        <div>Regards<br>
        </div>
        <div>Navin.</div>
        <div><br>
        </div>
        <div><br>
        </div>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Mon, Apr 27, 2020 at 9:37
          PM Daniel Letai <<a href="mailto:dani@letai.org.il" target="_blank" rel="noreferrer">dani@letai.org.il</a>> wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div>
            <p>Are you sure there are enough resources available? The
              node is in mixed state, so it's configured for both
              partitions - it's possible that earlier lower priority
              jobs are already running thus blocking the later jobs,
              especially since it's fifo.</p>
            <p><br>
            </p>
            <p>It would really help if you pasted the results of:</p>
            <p>squeue</p>
            <p>sinfo</p>
            <p><br>
            </p>
            <p>As well as the exact sbatch line, so we can see how many
              resources per node are requested.<br>
            </p>
            <p><br>
            </p>
            <div>On 26/04/2020 12:00:06, navin srivastava wrote:<br>
            </div>
            <blockquote type="cite">
              <div dir="ltr">Thanks Brian,
                <div><br>
                </div>
                <div>As suggested i gone through document and what i
                  understood  that the fair tree leads to the Fairshare
                  mechanism and based on that the job should be
                  scheduling.</div>
                <div><br>
                </div>
                <div>so it mean job scheduling will be based on FIFO but
                  priority will be decided on the Fairshare. i am not
                  sure if both conflicts here.if i see the normal jobs
                  priority is lower than the GPUsmall priority. so
                  resources are available with gpusmall partition then
                  it should go. there is no job pend due to gpu
                  resources. the gpu resources itself not asked with the
                  job.</div>
                <div><br>
                </div>
                <div>is there any article where i can see how the
                  fairshare works and which are setting should not be
                  conflict with this.</div>
                <div>According to document it never says that if
                  fair-share is applied then FIFO should be disabled.<br>
                </div>
                <div><br>
                </div>
                <div>Regards</div>
                <div>Navin.</div>
                <div><br>
                </div>
                <div><br>
                </div>
                <div><br>
                </div>
                <div><br>
                </div>
              </div>
              <br>
              <div class="gmail_quote">
                <div dir="ltr" class="gmail_attr">On Sat, Apr 25, 2020
                  at 12:47 AM Brian W. Johanson <<a href="mailto:bjohanso@psc.edu" target="_blank" rel="noreferrer">bjohanso@psc.edu</a>>
                  wrote:<br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                  <div> <br>
                    If you haven't looked at the man page for
                    slurm.conf, it will answer most if not all your
                    questions. <br>
                    <a href="https://slurm.schedmd.com/slurm.conf.html" target="_blank" rel="noreferrer">https://slurm.schedmd.com/slurm.conf.html</a>
                    but I would depend on the the manual version that
                    was distributed with the version you have installed
                    as options do change.<br>
                    <br>
                    There is a ton of information that is tedious to get
                    through but reading through it multiple times opens
                    many doors.<br>
                    <br>
                    DefaultTime is listed in there as a Partition
                    option. <br>
                    If you are scheduling gres/gpu resources, it's quite
                    possible there are cores available with no
                    corresponding gpus avail.<br>
                    <br>
                    -b<br>
                    <br>
                    <div>On 4/24/20 2:49 PM, navin srivastava wrote:<br>
                    </div>
                    <blockquote type="cite">
                      <div dir="auto">Thanks Brian. 
                        <div dir="auto"><br>
                        </div>
                        <div dir="auto">I need  to check the jobs
                          order. <br>
                          <div dir="auto"><br>
                          </div>
                          <div dir="auto">Is there  any way to define
                            the default timeline of the job if user  not
                            specifying time limit. </div>
                          <div dir="auto"><br>
                          </div>
                          <div dir="auto">Also what does the meaning of
                            fairtree  in priorities in slurm.Conf file. </div>
                          <div dir="auto"><br>
                          </div>
                          <div dir="auto">The set of nodes are different
                            in partitions.FIFO  does  not care for any 
                            partitiong. </div>
                          <div dir="auto">Is it like strict odering
                            means the job came 1st will go and until  it
                            runs it will  not allow others.</div>
                          <div dir="auto"><br>
                          </div>
                          <div dir="auto">Also priorities is high for
                            gpusmall partition and low for normal jobs
                            and the nodes of the normal partition is
                            full but gpusmall cores are available.</div>
                          <div dir="auto"><br>
                          </div>
                          <div dir="auto">Regards <br>
                          </div>
                          <div dir="auto">Navin </div>
                        </div>
                      </div>
                      <br>
                      <div class="gmail_quote">
                        <div dir="ltr" class="gmail_attr">On Fri, Apr
                          24, 2020, 23:49 Brian W. Johanson <<a href="mailto:bjohanso@psc.edu" target="_blank" rel="noreferrer">bjohanso@psc.edu</a>>
                          wrote:<br>
                        </div>
                        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                          <div> <tt>Without seeing the jobs in your
                              queue, I would expect the next job in FIFO
                              order to be too large to fit in the
                              current idle resources. <br>
                              <br>
                              Configure it to use the backfill
                              scheduler: </tt><tt><tt>SchedulerType=sched/backfill<br>
                                <br>
                              </tt>      SchedulerType<br>
                                            Identifies  the type of
                              scheduler to be used.  Note the slurmctld
                              daemon must be restarted for a change in
                              scheduler type to become effective
                              (reconfiguring a running daemon has no
                              effect for this parameter).  The scontrol
                              command can be used to manually change job
                              priorities if desired.  Acceptable values
                              include:<br>
                              <br>
                                            sched/backfill<br>
                                                   For a backfill
                              scheduling module to augment the default
                              FIFO scheduling.  Backfill scheduling will
                              initiate lower-priority jobs if doing so
                              does not delay the expected initiation
                              time of any  higher  priority  job.  
                              Effectiveness  of  backfill scheduling is
                              dependent upon users specifying job time
                              limits, otherwise all jobs will have the
                              same time limit and backfilling is
                              impossible.  Note documentation for the
                              SchedulerParameters option above.  This is
                              the default configuration.<br>
                              <br>
                                            sched/builtin<br>
                                                   This  is  the  FIFO
                              scheduler which initiates jobs in priority
                              order.  If any job in the partition can
                              not be scheduled, no lower priority job in
                              that partition will be scheduled.  An
                              exception is made for jobs that can not
                              run due to partition constraints (e.g. the
                              time limit) or down/drained nodes.  In
                              that case, lower priority jobs can be
                              initiated and not impact the higher
                              priority job.<br>
                              <br>
                              <br>
                              <br>
                              Your partitions are set with
                              maxtime=INFINITE, if your users are not
                              specifying a reasonable timelimit to their
                              jobs, this won't help either.<br>
                              <br>
                              <br>
                              -b<br>
                              <br>
                            </tt><br>
                            <div>On 4/24/20 1:52 PM, navin srivastava
                              wrote:<br>
                            </div>
                            <blockquote type="cite">
                              <div dir="ltr">In addition to the above
                                when i see the sprio of both the jobs it
                                says :-
                                <div><br>
                                </div>
                                <div>for normal queue jobs all jobs
                                  showing the same priority</div>
                                <div><br>
                                </div>
                                <div> JOBID PARTITION   PRIORITY
                                   FAIRSHARE<br>
                                          1291352 normal           15789
                                       15789<br>
                                </div>
                                <div><br>
                                </div>
                                <div>for GPUsmall all jobs showing the
                                  same priority.</div>
                                <div><br>
                                </div>
                                <div> JOBID PARTITION   PRIORITY
                                   FAIRSHARE<br>
                                          1291339 GPUsmall      21052  
                                     21053<br>
                                </div>
                              </div>
                              <br>
                              <div class="gmail_quote">
                                <div dir="ltr" class="gmail_attr">On
                                  Fri, Apr 24, 2020 at 11:14 PM navin
                                  srivastava <<a href="mailto:navin.altair@gmail.com" rel="noreferrer noreferrer" target="_blank">navin.altair@gmail.com</a>>
                                  wrote:<br>
                                </div>
                                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                  <div dir="ltr">Hi Team,<br>
                                    <div><br>
                                    </div>
                                    <div>we are facing some issue in our
                                      environment. The resources are
                                      free but job is going into the
                                      QUEUE state but not running.</div>
                                    <div><br>
                                    </div>
                                    <div>i have attached the
                                      slurm.conf file here.</div>
                                    <div><br>
                                    </div>
                                    <div>scenario:-</div>
                                    <div><br>
                                    </div>
                                    <div>There are job only in the 2
                                      partitions:</div>
                                    <div> 344 jobs are in PD state in
                                      normal partition and the node
                                      belongs from the normal partitions
                                      are full and no more job can run.</div>
                                    <div><br>
                                    </div>
                                    <div>1300 JOBS are in GPUsmall
                                      partition are in queue and enough
                                      CPU is avaiable to execute the
                                      jobs but i see the jobs are not
                                      scheduling on free nodes.</div>
                                    <div><br>
                                    </div>
                                    <div>Rest there are no pend jobs in
                                      any other partition .</div>
                                    <div>eg:-</div>
                                    <div>node status:- node18</div>
                                    <div><br>
                                    </div>
                                    <div>NodeName=node18 Arch=x86_64
                                      CoresPerSocket=18<br>
                                         CPUAlloc=6 CPUErr=0 CPUTot=36
                                      CPULoad=4.07<br>
                                         AvailableFeatures=K2200<br>
                                         ActiveFeatures=K2200<br>
                                         Gres=gpu:2<br>
                                         NodeAddr=node18
                                      NodeHostName=node18 Version=17.11<br>
                                         OS=Linux 4.4.140-94.42-default
                                      #1 SMP Tue Jul 17 07:44:50 UTC
                                      2018 (0b375e4)<br>
                                         RealMemory=1 AllocMem=0
                                      FreeMem=79532 Sockets=2 Boards=1<br>
                                         State=MIXED ThreadsPerCore=1
                                      TmpDisk=0 Weight=1 Owner=N/A
                                      MCS_label=N/A<br>
                                         Partitions=GPUsmall,pm_shared<br>
                                         BootTime=2019-12-10T14:16:37
                                      SlurmdStartTime=2019-12-10T14:24:08<br>
                                       
                                       CfgTRES=cpu=36,mem=1M,billing=36<br>
                                         AllocTRES=cpu=6<br>
                                         CapWatts=n/a<br>
                                         CurrentWatts=0 LowestJoules=0
                                      ConsumedJoules=0<br>
                                         ExtSensorsJoules=n/s
                                      ExtSensorsWatts=0
                                      ExtSensorsTemp=n/s<br>
                                    </div>
                                    <div><br>
                                    </div>
                                    <div>node19:-</div>
                                    <div><br>
                                    </div>
                                    <div>NodeName=node19 Arch=x86_64
                                      CoresPerSocket=18<br>
                                         CPUAlloc=16 CPUErr=0 CPUTot=36
                                      CPULoad=15.43<br>
                                         AvailableFeatures=K2200<br>
                                         ActiveFeatures=K2200<br>
                                         Gres=gpu:2<br>
                                         NodeAddr=node19
                                      NodeHostName=node19 Version=17.11<br>
                                         OS=Linux 4.12.14-94.41-default
                                      #1 SMP Wed Oct 31 12:25:04 UTC
                                      2018 (3090901)<br>
                                         RealMemory=1 AllocMem=0
                                      FreeMem=63998 Sockets=2 Boards=1<br>
                                         State=MIXED ThreadsPerCore=1
                                      TmpDisk=0 Weight=1 Owner=N/A
                                      MCS_label=N/A<br>
                                         Partitions=GPUsmall,pm_shared<br>
                                         BootTime=2020-03-12T06:51:54
                                      SlurmdStartTime=2020-03-12T06:53:14<br>
                                       
                                       CfgTRES=cpu=36,mem=1M,billing=36<br>
                                         AllocTRES=cpu=16<br>
                                         CapWatts=n/a<br>
                                         CurrentWatts=0 LowestJoules=0
                                      ConsumedJoules=0<br>
                                         ExtSensorsJoules=n/s
                                      ExtSensorsWatts=0
                                      ExtSensorsTemp=n/s<br>
                                    </div>
                                    <div><br>
                                    </div>
                                    <div>could you please help me to
                                      understand what could be the
                                      reason?</div>
                                    <div><br>
                                    </div>
                                    <div><br>
                                    </div>
                                    <div><br>
                                    </div>
                                    <div><br>
                                    </div>
                                    <div><br>
                                    </div>
                                    <div><br>
                                    </div>
                                    <div><br>
                                    </div>
                                    <div><br>
                                    </div>
                                    <div><br>
                                    </div>
                                  </div>
                                </blockquote>
                              </div>
                            </blockquote>
                            <br>
                          </div>
                        </blockquote>
                      </div>
                    </blockquote>
                    <br>
                  </div>
                </blockquote>
              </div>
            </blockquote>
            <pre cols="72">-- 
Regards,

Daniel Letai
+972 (0)505 870 456</pre>
          </div>
        </blockquote>
      </div>
    </blockquote>
    <pre cols="72">-- 
Regards,

Daniel Letai
+972 (0)505 870 456</pre>
  </div>


</blockquote></div>