<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <tt>Without seeing the jobs in your queue, I would expect the next

      job in FIFO order to be too large to fit in the current idle

      resources. <br>

      <br>

      Configure it to use the backfill scheduler: </tt><tt><tt>SchedulerType=sched/backfill<br>

        <br>

      </tt>      SchedulerType<br>

                    Identifies  the type of scheduler to be used.  Note

      the slurmctld daemon must be restarted for a change in scheduler

      type to become effective (reconfiguring a running daemon has no

      effect for this parameter).  The scontrol command can be used to

      manually change job priorities if desired.  Acceptable values

      include:<br>

      <br>

                    sched/backfill<br>

                           For a backfill scheduling module to augment

      the default FIFO scheduling.  Backfill scheduling will initiate

      lower-priority jobs if doing so does not delay the expected

      initiation time of any  higher  priority  job.   Effectiveness 

      of  backfill scheduling is dependent upon users specifying job

      time limits, otherwise all jobs will have the same time limit and

      backfilling is impossible.  Note documentation for the

      SchedulerParameters option above.  This is the default

      configuration.<br>

      <br>

                    sched/builtin<br>

                           This  is  the  FIFO scheduler which initiates

      jobs in priority order.  If any job in the partition can not be

      scheduled, no lower priority job in that partition will be

      scheduled.  An exception is made for jobs that can not run due to

      partition constraints (e.g. the time limit) or down/drained

      nodes.  In that case, lower priority jobs can be initiated and not

      impact the higher priority job.<br>

      <br>

      <br>

      <br>

      Your partitions are set with maxtime=INFINITE, if your users are

      not specifying a reasonable timelimit to their jobs, this won't

      help either.<br>

      <br>

      <br>

      -b<br>

      <br>

    </tt><br>

    <div class="moz-cite-prefix">On 4/24/20 1:52 PM, navin srivastava

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAK8-jZAe1ATfkfNesMoSHGJQrprEc3QUgzpm3pH5bv7tKY_-Ug@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">In addition to the above when i see the sprio of

        both the jobs it says :-

        <div><br>

        </div>

        <div>for normal queue jobs all jobs showing the same priority</div>

        <div><br>

        </div>

        <div> JOBID PARTITION   PRIORITY  FAIRSHARE<br>

                  1291352 normal           15789      15789<br>

        </div>

        <div><br>

        </div>

        <div>for GPUsmall all jobs showing the same priority.</div>

        <div><br>

        </div>

        <div> JOBID PARTITION   PRIORITY  FAIRSHARE<br>

                  1291339 GPUsmall      21052      21053<br>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Fri, Apr 24, 2020 at 11:14

          PM navin srivastava <<a

            href="mailto:navin.altair@gmail.com" moz-do-not-send="true">navin.altair@gmail.com</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px

          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

          <div dir="ltr">Hi Team,<br>

            <div><br>

            </div>

            <div>we are facing some issue in our environment. The

              resources are free but job is going into the QUEUE state

              but not running.</div>

            <div><br>

            </div>

            <div>i have attached the slurm.conf file here.</div>

            <div><br>

            </div>

            <div>scenario:-</div>

            <div><br>

            </div>

            <div>There are job only in the 2 partitions:</div>

            <div> 344 jobs are in PD state in normal partition and the

              node belongs from the normal partitions are full and no

              more job can run.</div>

            <div><br>

            </div>

            <div>1300 JOBS are in GPUsmall partition are in queue and

              enough CPU is avaiable to execute the jobs but i see the

              jobs are not scheduling on free nodes.</div>

            <div><br>

            </div>

            <div>Rest there are no pend jobs in any other partition .</div>

            <div>eg:-</div>

            <div>node status:- node18</div>

            <div><br>

            </div>

            <div>NodeName=node18 Arch=x86_64 CoresPerSocket=18<br>

                 CPUAlloc=6 CPUErr=0 CPUTot=36 CPULoad=4.07<br>

                 AvailableFeatures=K2200<br>

                 ActiveFeatures=K2200<br>

                 Gres=gpu:2<br>

                 NodeAddr=node18 NodeHostName=node18 Version=17.11<br>

                 OS=Linux 4.4.140-94.42-default #1 SMP Tue Jul 17

              07:44:50 UTC 2018 (0b375e4)<br>

                 RealMemory=1 AllocMem=0 FreeMem=79532 Sockets=2

              Boards=1<br>

                 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1

              Owner=N/A MCS_label=N/A<br>

                 Partitions=GPUsmall,pm_shared<br>

                 BootTime=2019-12-10T14:16:37

              SlurmdStartTime=2019-12-10T14:24:08<br>

                 CfgTRES=cpu=36,mem=1M,billing=36<br>

                 AllocTRES=cpu=6<br>

                 CapWatts=n/a<br>

                 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<br>

                 ExtSensorsJoules=n/s ExtSensorsWatts=0

              ExtSensorsTemp=n/s<br>

            </div>

            <div><br>

            </div>

            <div>node19:-</div>

            <div><br>

            </div>

            <div>NodeName=node19 Arch=x86_64 CoresPerSocket=18<br>

                 CPUAlloc=16 CPUErr=0 CPUTot=36 CPULoad=15.43<br>

                 AvailableFeatures=K2200<br>

                 ActiveFeatures=K2200<br>

                 Gres=gpu:2<br>

                 NodeAddr=node19 NodeHostName=node19 Version=17.11<br>

                 OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct 31

              12:25:04 UTC 2018 (3090901)<br>

                 RealMemory=1 AllocMem=0 FreeMem=63998 Sockets=2

              Boards=1<br>

                 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1

              Owner=N/A MCS_label=N/A<br>

                 Partitions=GPUsmall,pm_shared<br>

                 BootTime=2020-03-12T06:51:54

              SlurmdStartTime=2020-03-12T06:53:14<br>

                 CfgTRES=cpu=36,mem=1M,billing=36<br>

                 AllocTRES=cpu=16<br>

                 CapWatts=n/a<br>

                 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<br>

                 ExtSensorsJoules=n/s ExtSensorsWatts=0

              ExtSensorsTemp=n/s<br>

            </div>

            <div><br>

            </div>

            <div>could you please help me to understand what could be

              the reason?</div>

            <div><br>

            </div>

            <div><br>

            </div>

            <div><br>

            </div>

            <div><br>

            </div>

            <div><br>

            </div>

            <div><br>

            </div>

            <div><br>

            </div>

            <div><br>

            </div>

            <div><br>

            </div>

          </div>

        </blockquote>

      </div>

    </blockquote>

    <br>

  </body>

</html>