<div dir="auto">Thanks Denial for detailed  Description<div dir="auto"><br></div><div dir="auto">Regards </div><div dir="auto">Navin </div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, May 3, 2020, 13:35 Daniel Letai <<a href="mailto:dani@letai.org.il">dani@letai.org.il</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div>

    <br>

    <div>On 29/04/2020 12:00:13, navin

      srivastava wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">Thanks Daniel.

        <div> </div>

        <div>All jobs went into run state so unable to provide the

          details but definitely will reach out later if we see similar

          issue.</div>

        <div><br>

        </div>

        <div>i am more interested to understand the FIFO with Fair

          Tree.it will be good if anybody provide some insight on this

          combination and also if we will enable the backfilling here

          how the behaviour will change.</div>

        <div><br>

        </div>

        <div>what is the role of the Fair tree here?<br>

        </div>

      </div>

    </blockquote>

    <p>Fair tree is the algorithm used to calculate the interim

      priority, before applying weight, but I think after the halflife

      decay.</p>

    <p><br>

    </p>

    <p>To make it simple - fifo without fairshare would assign priority

      based only on submission time. With faishare, that naive priority

      is adjusted based on prior usage by the applicable entities

      (users/departments - accounts).</p>

    <p><br>

    </p>

    <p>Backfill will let you utilize your resources better, since it

      will allow "inserting" low priority jobs before higher priority

      jobs, provided all jobs have defined wall times, and any inserted

      job doesn't affect in any way the start time of a higher priority

      job, thus allowing utilization of "holes" when the scheduler waits

      for resources to free up, in order to insert some large job.</p>

    <p><br>

    </p>

    <p>Suppose the system is at 60% utilization of cores, and the next

      fifo job requires 42% - it will wait until 2% are free so it can

      begin, meanwhile not allowing any job to start, even if it would

      tke only 30% of the resources (whic are currently free) and would

      finish before the 2% are free anyway.</p>

    <p>Backfill would allow such job to start, as long as it's wall time

      ensures it would finish before the 42% job would've started.</p>

    <p><br>

    </p>

    <p>Fairtree in either case (fifo or backfill) calculates the

      priority for each job the same - if the account had used more

      resources recently (the halflife decay factor) it would get a

      lower priority even though it was submitted earlier than a job

      from an account that didn't use any resources recently.</p>

    <p><br>

    </p>

    <p>As can be expected, backtree has to loop over all jobs in the

      queue, in order to see if any job can fit out of order. In very

      busy/active systems, that can lead to poor response times, unless

      tuned correctly in slurm conf - look at SchedulerParameters, all

      params starting with bf_ and in particular bf_max_job_test=

      ,bf_max_time= and bf_continue (but bf_window= can also have some

      impact if set too high).<br>

    </p>

    <p>see the man page at

      <a href="https://slurm.schedmd.com/slurm.conf.html#OPT_SchedulerParameters" target="_blank" rel="noreferrer">https://slurm.schedmd.com/slurm.conf.html#OPT_SchedulerParameters</a><br>

    </p>

    <blockquote type="cite">

      <div dir="ltr">

        <div><br>

        </div>

        <div>PriorityType=priority/multifactor<br>

        </div>

        <div>PriorityDecayHalfLife=2<br>

          PriorityUsageResetPeriod=DAILY<br>

          PriorityWeightFairshare=500000<br>

          PriorityFlags=FAIR_TREE<br>

        </div>

        <div><br>

        </div>

        <div>Regards<br>

        </div>

        <div>Navin.</div>

        <div><br>

        </div>

        <div><br>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Mon, Apr 27, 2020 at 9:37

          PM Daniel Letai <<a href="mailto:dani@letai.org.il" target="_blank" rel="noreferrer">dani@letai.org.il</a>> wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

          <div>

            <p>Are you sure there are enough resources available? The

              node is in mixed state, so it's configured for both

              partitions - it's possible that earlier lower priority

              jobs are already running thus blocking the later jobs,

              especially since it's fifo.</p>

            <p><br>

            </p>

            <p>It would really help if you pasted the results of:</p>

            <p>squeue</p>

            <p>sinfo</p>

            <p><br>

            </p>

            <p>As well as the exact sbatch line, so we can see how many

              resources per node are requested.<br>

            </p>

            <p><br>

            </p>

            <div>On 26/04/2020 12:00:06, navin srivastava wrote:<br>

            </div>

            <blockquote type="cite">

              <div dir="ltr">Thanks Brian,

                <div><br>

                </div>

                <div>As suggested i gone through document and what i

                  understood  that the fair tree leads to the Fairshare

                  mechanism and based on that the job should be

                  scheduling.</div>

                <div><br>

                </div>

                <div>so it mean job scheduling will be based on FIFO but

                  priority will be decided on the Fairshare. i am not

                  sure if both conflicts here.if i see the normal jobs

                  priority is lower than the GPUsmall priority. so

                  resources are available with gpusmall partition then

                  it should go. there is no job pend due to gpu

                  resources. the gpu resources itself not asked with the

                  job.</div>

                <div><br>

                </div>

                <div>is there any article where i can see how the

                  fairshare works and which are setting should not be

                  conflict with this.</div>

                <div>According to document it never says that if

                  fair-share is applied then FIFO should be disabled.<br>

                </div>

                <div><br>

                </div>

                <div>Regards</div>

                <div>Navin.</div>

                <div><br>

                </div>

                <div><br>

                </div>

                <div><br>

                </div>

                <div><br>

                </div>

              </div>

              <br>

              <div class="gmail_quote">

                <div dir="ltr" class="gmail_attr">On Sat, Apr 25, 2020

                  at 12:47 AM Brian W. Johanson <<a href="mailto:bjohanso@psc.edu" target="_blank" rel="noreferrer">bjohanso@psc.edu</a>>

                  wrote:<br>

                </div>

                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                  <div> <br>

                    If you haven't looked at the man page for

                    slurm.conf, it will answer most if not all your

                    questions. <br>

                    <a href="https://slurm.schedmd.com/slurm.conf.html" target="_blank" rel="noreferrer">https://slurm.schedmd.com/slurm.conf.html</a>

                    but I would depend on the the manual version that

                    was distributed with the version you have installed

                    as options do change.<br>

                    <br>

                    There is a ton of information that is tedious to get

                    through but reading through it multiple times opens

                    many doors.<br>

                    <br>

                    DefaultTime is listed in there as a Partition

                    option. <br>

                    If you are scheduling gres/gpu resources, it's quite

                    possible there are cores available with no

                    corresponding gpus avail.<br>

                    <br>

                    -b<br>

                    <br>

                    <div>On 4/24/20 2:49 PM, navin srivastava wrote:<br>

                    </div>

                    <blockquote type="cite">

                      <div dir="auto">Thanks Brian. 

                        <div dir="auto"><br>

                        </div>

                        <div dir="auto">I need  to check the jobs

                          order. <br>

                          <div dir="auto"><br>

                          </div>

                          <div dir="auto">Is there  any way to define

                            the default timeline of the job if user  not

                            specifying time limit. </div>

                          <div dir="auto"><br>

                          </div>

                          <div dir="auto">Also what does the meaning of

                            fairtree  in priorities in slurm.Conf file. </div>

                          <div dir="auto"><br>

                          </div>

                          <div dir="auto">The set of nodes are different

                            in partitions.FIFO  does  not care for any 

                            partitiong. </div>

                          <div dir="auto">Is it like strict odering

                            means the job came 1st will go and until  it

                            runs it will  not allow others.</div>

                          <div dir="auto"><br>

                          </div>

                          <div dir="auto">Also priorities is high for

                            gpusmall partition and low for normal jobs

                            and the nodes of the normal partition is

                            full but gpusmall cores are available.</div>

                          <div dir="auto"><br>

                          </div>

                          <div dir="auto">Regards <br>

                          </div>

                          <div dir="auto">Navin </div>

                        </div>

                      </div>

                      <br>

                      <div class="gmail_quote">

                        <div dir="ltr" class="gmail_attr">On Fri, Apr

                          24, 2020, 23:49 Brian W. Johanson <<a href="mailto:bjohanso@psc.edu" target="_blank" rel="noreferrer">bjohanso@psc.edu</a>>

                          wrote:<br>

                        </div>

                        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                          <div> <tt>Without seeing the jobs in your

                              queue, I would expect the next job in FIFO

                              order to be too large to fit in the

                              current idle resources. <br>

                              <br>

                              Configure it to use the backfill

                              scheduler: </tt><tt><tt>SchedulerType=sched/backfill<br>

                                <br>

                              </tt>      SchedulerType<br>

                                            Identifies  the type of

                              scheduler to be used.  Note the slurmctld

                              daemon must be restarted for a change in

                              scheduler type to become effective

                              (reconfiguring a running daemon has no

                              effect for this parameter).  The scontrol

                              command can be used to manually change job

                              priorities if desired.  Acceptable values

                              include:<br>

                              <br>

                                            sched/backfill<br>

                                                   For a backfill

                              scheduling module to augment the default

                              FIFO scheduling.  Backfill scheduling will

                              initiate lower-priority jobs if doing so

                              does not delay the expected initiation

                              time of any  higher  priority  job.  

                              Effectiveness  of  backfill scheduling is

                              dependent upon users specifying job time

                              limits, otherwise all jobs will have the

                              same time limit and backfilling is

                              impossible.  Note documentation for the

                              SchedulerParameters option above.  This is

                              the default configuration.<br>

                              <br>

                                            sched/builtin<br>

                                                   This  is  the  FIFO

                              scheduler which initiates jobs in priority

                              order.  If any job in the partition can

                              not be scheduled, no lower priority job in

                              that partition will be scheduled.  An

                              exception is made for jobs that can not

                              run due to partition constraints (e.g. the

                              time limit) or down/drained nodes.  In

                              that case, lower priority jobs can be

                              initiated and not impact the higher

                              priority job.<br>

                              <br>

                              <br>

                              <br>

                              Your partitions are set with

                              maxtime=INFINITE, if your users are not

                              specifying a reasonable timelimit to their

                              jobs, this won't help either.<br>

                              <br>

                              <br>

                              -b<br>

                              <br>

                            </tt><br>

                            <div>On 4/24/20 1:52 PM, navin srivastava

                              wrote:<br>

                            </div>

                            <blockquote type="cite">

                              <div dir="ltr">In addition to the above

                                when i see the sprio of both the jobs it

                                says :-

                                <div><br>

                                </div>

                                <div>for normal queue jobs all jobs

                                  showing the same priority</div>

                                <div><br>

                                </div>

                                <div> JOBID PARTITION   PRIORITY

                                   FAIRSHARE<br>

                                          1291352 normal           15789

                                       15789<br>

                                </div>

                                <div><br>

                                </div>

                                <div>for GPUsmall all jobs showing the

                                  same priority.</div>

                                <div><br>

                                </div>

                                <div> JOBID PARTITION   PRIORITY

                                   FAIRSHARE<br>

                                          1291339 GPUsmall      21052  

                                     21053<br>

                                </div>

                              </div>

                              <br>

                              <div class="gmail_quote">

                                <div dir="ltr" class="gmail_attr">On

                                  Fri, Apr 24, 2020 at 11:14 PM navin

                                  srivastava <<a href="mailto:navin.altair@gmail.com" rel="noreferrer noreferrer" target="_blank">navin.altair@gmail.com</a>>

                                  wrote:<br>

                                </div>

                                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                                  <div dir="ltr">Hi Team,<br>

                                    <div><br>

                                    </div>

                                    <div>we are facing some issue in our

                                      environment. The resources are

                                      free but job is going into the

                                      QUEUE state but not running.</div>

                                    <div><br>

                                    </div>

                                    <div>i have attached the

                                      slurm.conf file here.</div>

                                    <div><br>

                                    </div>

                                    <div>scenario:-</div>

                                    <div><br>

                                    </div>

                                    <div>There are job only in the 2

                                      partitions:</div>

                                    <div> 344 jobs are in PD state in

                                      normal partition and the node

                                      belongs from the normal partitions

                                      are full and no more job can run.</div>

                                    <div><br>

                                    </div>

                                    <div>1300 JOBS are in GPUsmall

                                      partition are in queue and enough

                                      CPU is avaiable to execute the

                                      jobs but i see the jobs are not

                                      scheduling on free nodes.</div>

                                    <div><br>

                                    </div>

                                    <div>Rest there are no pend jobs in

                                      any other partition .</div>

                                    <div>eg:-</div>

                                    <div>node status:- node18</div>

                                    <div><br>

                                    </div>

                                    <div>NodeName=node18 Arch=x86_64

                                      CoresPerSocket=18<br>

                                         CPUAlloc=6 CPUErr=0 CPUTot=36

                                      CPULoad=4.07<br>

                                         AvailableFeatures=K2200<br>

                                         ActiveFeatures=K2200<br>

                                         Gres=gpu:2<br>

                                         NodeAddr=node18

                                      NodeHostName=node18 Version=17.11<br>

                                         OS=Linux 4.4.140-94.42-default

                                      #1 SMP Tue Jul 17 07:44:50 UTC

                                      2018 (0b375e4)<br>

                                         RealMemory=1 AllocMem=0

                                      FreeMem=79532 Sockets=2 Boards=1<br>

                                         State=MIXED ThreadsPerCore=1

                                      TmpDisk=0 Weight=1 Owner=N/A

                                      MCS_label=N/A<br>

                                         Partitions=GPUsmall,pm_shared<br>

                                         BootTime=2019-12-10T14:16:37

                                      SlurmdStartTime=2019-12-10T14:24:08<br>

                                       CfgTRES=cpu=36,mem=1M,billing=36<br>

                                         AllocTRES=cpu=6<br>

                                         CapWatts=n/a<br>

                                         CurrentWatts=0 LowestJoules=0

                                      ConsumedJoules=0<br>

                                         ExtSensorsJoules=n/s

                                      ExtSensorsWatts=0

                                      ExtSensorsTemp=n/s<br>

                                    </div>

                                    <div><br>

                                    </div>

                                    <div>node19:-</div>

                                    <div><br>

                                    </div>

                                    <div>NodeName=node19 Arch=x86_64

                                      CoresPerSocket=18<br>

                                         CPUAlloc=16 CPUErr=0 CPUTot=36

                                      CPULoad=15.43<br>

                                         AvailableFeatures=K2200<br>

                                         ActiveFeatures=K2200<br>

                                         Gres=gpu:2<br>

                                         NodeAddr=node19

                                      NodeHostName=node19 Version=17.11<br>

                                         OS=Linux 4.12.14-94.41-default

                                      #1 SMP Wed Oct 31 12:25:04 UTC

                                      2018 (3090901)<br>

                                         RealMemory=1 AllocMem=0

                                      FreeMem=63998 Sockets=2 Boards=1<br>

                                         State=MIXED ThreadsPerCore=1

                                      TmpDisk=0 Weight=1 Owner=N/A

                                      MCS_label=N/A<br>

                                         Partitions=GPUsmall,pm_shared<br>

                                         BootTime=2020-03-12T06:51:54

                                      SlurmdStartTime=2020-03-12T06:53:14<br>

                                       CfgTRES=cpu=36,mem=1M,billing=36<br>

                                         AllocTRES=cpu=16<br>

                                         CapWatts=n/a<br>

                                         CurrentWatts=0 LowestJoules=0

                                      ConsumedJoules=0<br>

                                         ExtSensorsJoules=n/s

                                      ExtSensorsWatts=0

                                      ExtSensorsTemp=n/s<br>

                                    </div>

                                    <div><br>

                                    </div>

                                    <div>could you please help me to

                                      understand what could be the

                                      reason?</div>

                                    <div><br>

                                    </div>

                                    <div><br>

                                    </div>

                                    <div><br>

                                    </div>

                                    <div><br>

                                    </div>

                                    <div><br>

                                    </div>

                                    <div><br>

                                    </div>

                                    <div><br>

                                    </div>

                                    <div><br>

                                    </div>

                                    <div><br>

                                    </div>

                                  </div>

                                </blockquote>

                              </div>

                            </blockquote>

                            <br>

                          </div>

                        </blockquote>

                      </div>

                    </blockquote>

                    <br>

                  </div>

                </blockquote>

              </div>

            </blockquote>

            <pre cols="72">-- 

Regards,

Daniel Letai

+972 (0)505 870 456</pre>

          </div>

        </blockquote>

      </div>

    </blockquote>

    <pre cols="72">-- 

Regards,

Daniel Letai

+972 (0)505 870 456</pre>

  </div>

</blockquote></div>