<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <br>
    If you haven't looked at the man page for slurm.conf, it will answer
    most if not all your questions. <br>
    <a class="moz-txt-link-freetext" href="https://slurm.schedmd.com/slurm.conf.html">https://slurm.schedmd.com/slurm.conf.html</a> but I would depend on the
    the manual version that was distributed with the version you have
    installed as options do change.<br>
    <br>
    There is a ton of information that is tedious to get through but
    reading through it multiple times opens many doors.<br>
    <br>
    DefaultTime is listed in there as a Partition option. <br>
    If you are scheduling gres/gpu resources, it's quite possible there
    are cores available with no corresponding gpus avail.<br>
    <br>
    -b<br>
    <br>
    <div class="moz-cite-prefix">On 4/24/20 2:49 PM, navin srivastava
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAK8-jZAVMt8tNDGYWYssL0UkhcQ7TyuV5k0Lyh+Qd8bRx3KHRg@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="auto">Thanks Brian. 
        <div dir="auto"><br>
        </div>
        <div dir="auto">I need  to check the jobs order. <br>
          <div dir="auto"><br>
          </div>
          <div dir="auto">Is there  any way to define the default
            timeline of the job if user  not specifying time limit. </div>
          <div dir="auto"><br>
          </div>
          <div dir="auto">Also what does the meaning of fairtree  in
            priorities in slurm.Conf file. </div>
          <div dir="auto"><br>
          </div>
          <div dir="auto">The set of nodes are different in
            partitions.FIFO  does  not care for any  partitiong. </div>
          <div dir="auto">Is it like strict odering means the job came
            1st will go and until  it runs it will  not allow others.</div>
          <div dir="auto"><br>
          </div>
          <div dir="auto">Also priorities is high for gpusmall partition
            and low for normal jobs and the nodes of the normal
            partition is full but gpusmall cores are available.</div>
          <div dir="auto"><br>
          </div>
          <div dir="auto">Regards <br>
          </div>
          <div dir="auto">Navin </div>
        </div>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Fri, Apr 24, 2020, 23:49
          Brian W. Johanson <<a href="mailto:bjohanso@psc.edu"
            moz-do-not-send="true">bjohanso@psc.edu</a>> wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0 0 0
          .8ex;border-left:1px #ccc solid;padding-left:1ex">
          <div> <tt>Without seeing the jobs in your queue, I would
              expect the next job in FIFO order to be too large to fit
              in the current idle resources. <br>
              <br>
              Configure it to use the backfill scheduler: </tt><tt><tt>SchedulerType=sched/backfill<br>
                <br>
              </tt>      SchedulerType<br>
                            Identifies  the type of scheduler to be
              used.  Note the slurmctld daemon must be restarted for a
              change in scheduler type to become effective
              (reconfiguring a running daemon has no effect for this
              parameter).  The scontrol command can be used to manually
              change job priorities if desired.  Acceptable values
              include:<br>
              <br>
                            sched/backfill<br>
                                   For a backfill scheduling module to
              augment the default FIFO scheduling.  Backfill scheduling
              will initiate lower-priority jobs if doing so does not
              delay the expected initiation time of any  higher 
              priority  job.   Effectiveness  of  backfill scheduling is
              dependent upon users specifying job time limits, otherwise
              all jobs will have the same time limit and backfilling is
              impossible.  Note documentation for the
              SchedulerParameters option above.  This is the default
              configuration.<br>
              <br>
                            sched/builtin<br>
                                   This  is  the  FIFO scheduler which
              initiates jobs in priority order.  If any job in the
              partition can not be scheduled, no lower priority job in
              that partition will be scheduled.  An exception is made
              for jobs that can not run due to partition constraints
              (e.g. the time limit) or down/drained nodes.  In that
              case, lower priority jobs can be initiated and not impact
              the higher priority job.<br>
              <br>
              <br>
              <br>
              Your partitions are set with maxtime=INFINITE, if your
              users are not specifying a reasonable timelimit to their
              jobs, this won't help either.<br>
              <br>
              <br>
              -b<br>
              <br>
            </tt><br>
            <div>On 4/24/20 1:52 PM, navin srivastava wrote:<br>
            </div>
            <blockquote type="cite">
              <div dir="ltr">In addition to the above when i see the
                sprio of both the jobs it says :-
                <div><br>
                </div>
                <div>for normal queue jobs all jobs showing the same
                  priority</div>
                <div><br>
                </div>
                <div> JOBID PARTITION   PRIORITY  FAIRSHARE<br>
                          1291352 normal           15789      15789<br>
                </div>
                <div><br>
                </div>
                <div>for GPUsmall all jobs showing the same priority.</div>
                <div><br>
                </div>
                <div> JOBID PARTITION   PRIORITY  FAIRSHARE<br>
                          1291339 GPUsmall      21052      21053<br>
                </div>
              </div>
              <br>
              <div class="gmail_quote">
                <div dir="ltr" class="gmail_attr">On Fri, Apr 24, 2020
                  at 11:14 PM navin srivastava <<a
                    href="mailto:navin.altair@gmail.com" target="_blank"
                    rel="noreferrer" moz-do-not-send="true">navin.altair@gmail.com</a>>
                  wrote:<br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px
                  0px 0.8ex;border-left:1px solid
                  rgb(204,204,204);padding-left:1ex">
                  <div dir="ltr">Hi Team,<br>
                    <div><br>
                    </div>
                    <div>we are facing some issue in our environment.
                      The resources are free but job is going into the
                      QUEUE state but not running.</div>
                    <div><br>
                    </div>
                    <div>i have attached the slurm.conf file here.</div>
                    <div><br>
                    </div>
                    <div>scenario:-</div>
                    <div><br>
                    </div>
                    <div>There are job only in the 2 partitions:</div>
                    <div> 344 jobs are in PD state in normal partition
                      and the node belongs from the normal partitions
                      are full and no more job can run.</div>
                    <div><br>
                    </div>
                    <div>1300 JOBS are in GPUsmall partition are in
                      queue and enough CPU is avaiable to execute the
                      jobs but i see the jobs are not scheduling on free
                      nodes.</div>
                    <div><br>
                    </div>
                    <div>Rest there are no pend jobs in any other
                      partition .</div>
                    <div>eg:-</div>
                    <div>node status:- node18</div>
                    <div><br>
                    </div>
                    <div>NodeName=node18 Arch=x86_64 CoresPerSocket=18<br>
                         CPUAlloc=6 CPUErr=0 CPUTot=36 CPULoad=4.07<br>
                         AvailableFeatures=K2200<br>
                         ActiveFeatures=K2200<br>
                         Gres=gpu:2<br>
                         NodeAddr=node18 NodeHostName=node18
                      Version=17.11<br>
                         OS=Linux 4.4.140-94.42-default #1 SMP Tue Jul
                      17 07:44:50 UTC 2018 (0b375e4)<br>
                         RealMemory=1 AllocMem=0 FreeMem=79532 Sockets=2
                      Boards=1<br>
                         State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1
                      Owner=N/A MCS_label=N/A<br>
                         Partitions=GPUsmall,pm_shared<br>
                         BootTime=2019-12-10T14:16:37
                      SlurmdStartTime=2019-12-10T14:24:08<br>
                         CfgTRES=cpu=36,mem=1M,billing=36<br>
                         AllocTRES=cpu=6<br>
                         CapWatts=n/a<br>
                         CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<br>
                         ExtSensorsJoules=n/s ExtSensorsWatts=0
                      ExtSensorsTemp=n/s<br>
                    </div>
                    <div><br>
                    </div>
                    <div>node19:-</div>
                    <div><br>
                    </div>
                    <div>NodeName=node19 Arch=x86_64 CoresPerSocket=18<br>
                         CPUAlloc=16 CPUErr=0 CPUTot=36 CPULoad=15.43<br>
                         AvailableFeatures=K2200<br>
                         ActiveFeatures=K2200<br>
                         Gres=gpu:2<br>
                         NodeAddr=node19 NodeHostName=node19
                      Version=17.11<br>
                         OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct
                      31 12:25:04 UTC 2018 (3090901)<br>
                         RealMemory=1 AllocMem=0 FreeMem=63998 Sockets=2
                      Boards=1<br>
                         State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1
                      Owner=N/A MCS_label=N/A<br>
                         Partitions=GPUsmall,pm_shared<br>
                         BootTime=2020-03-12T06:51:54
                      SlurmdStartTime=2020-03-12T06:53:14<br>
                         CfgTRES=cpu=36,mem=1M,billing=36<br>
                         AllocTRES=cpu=16<br>
                         CapWatts=n/a<br>
                         CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<br>
                         ExtSensorsJoules=n/s ExtSensorsWatts=0
                      ExtSensorsTemp=n/s<br>
                    </div>
                    <div><br>
                    </div>
                    <div>could you please help me to understand what
                      could be the reason?</div>
                    <div><br>
                    </div>
                    <div><br>
                    </div>
                    <div><br>
                    </div>
                    <div><br>
                    </div>
                    <div><br>
                    </div>
                    <div><br>
                    </div>
                    <div><br>
                    </div>
                    <div><br>
                    </div>
                    <div><br>
                    </div>
                  </div>
                </blockquote>
              </div>
            </blockquote>
            <br>
          </div>
        </blockquote>
      </div>
    </blockquote>
    <br>
  </body>
</html>