<div dir="ltr">Thanks Brian,<div><br></div><div>As suggested i gone through document and what i understood that the fair tree leads to the Fairshare mechanism and based on that the job should be scheduling.</div><div><br></div><div>so it mean job scheduling will be based on FIFO but priority will be decided on the Fairshare. i am not sure if both conflicts here.if i see the normal jobs priority is lower than the GPUsmall priority. so resources are available with gpusmall partition then it should go. there is no job pend due to gpu resources. the gpu resources itself not asked with the job.</div><div><br></div><div>is there any article where i can see how the fairshare works and which are setting should not be conflict with this.</div><div>According to document it never says that if fair-share is applied then FIFO should be disabled.<br></div><div><br></div><div>Regards</div><div>Navin.</div><div><br></div><div><br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Apr 25, 2020 at 12:47 AM Brian W. Johanson <<a href="mailto:bjohanso@psc.edu">bjohanso@psc.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<br>
If you haven't looked at the man page for slurm.conf, it will answer
most if not all your questions. <br>
<a href="https://slurm.schedmd.com/slurm.conf.html" target="_blank">https://slurm.schedmd.com/slurm.conf.html</a> but I would depend on the
the manual version that was distributed with the version you have
installed as options do change.<br>
<br>
There is a ton of information that is tedious to get through but
reading through it multiple times opens many doors.<br>
<br>
DefaultTime is listed in there as a Partition option. <br>
If you are scheduling gres/gpu resources, it's quite possible there
are cores available with no corresponding gpus avail.<br>
<br>
-b<br>
<br>
<div>On 4/24/20 2:49 PM, navin srivastava
wrote:<br>
</div>
<blockquote type="cite">
<div dir="auto">Thanks Brian.
<div dir="auto"><br>
</div>
<div dir="auto">I need to check the jobs order. <br>
<div dir="auto"><br>
</div>
<div dir="auto">Is there any way to define the default
timeline of the job if user not specifying time limit. </div>
<div dir="auto"><br>
</div>
<div dir="auto">Also what does the meaning of fairtree in
priorities in slurm.Conf file. </div>
<div dir="auto"><br>
</div>
<div dir="auto">The set of nodes are different in
partitions.FIFO does not care for any partitiong. </div>
<div dir="auto">Is it like strict odering means the job came
1st will go and until it runs it will not allow others.</div>
<div dir="auto"><br>
</div>
<div dir="auto">Also priorities is high for gpusmall partition
and low for normal jobs and the nodes of the normal
partition is full but gpusmall cores are available.</div>
<div dir="auto"><br>
</div>
<div dir="auto">Regards <br>
</div>
<div dir="auto">Navin </div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Apr 24, 2020, 23:49
Brian W. Johanson <<a href="mailto:bjohanso@psc.edu" target="_blank">bjohanso@psc.edu</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div> <tt>Without seeing the jobs in your queue, I would
expect the next job in FIFO order to be too large to fit
in the current idle resources. <br>
<br>
Configure it to use the backfill scheduler: </tt><tt><tt>SchedulerType=sched/backfill<br>
<br>
</tt> SchedulerType<br>
Identifies the type of scheduler to be
used. Note the slurmctld daemon must be restarted for a
change in scheduler type to become effective
(reconfiguring a running daemon has no effect for this
parameter). The scontrol command can be used to manually
change job priorities if desired. Acceptable values
include:<br>
<br>
sched/backfill<br>
For a backfill scheduling module to
augment the default FIFO scheduling. Backfill scheduling
will initiate lower-priority jobs if doing so does not
delay the expected initiation time of any higher
priority job. Effectiveness of backfill scheduling is
dependent upon users specifying job time limits, otherwise
all jobs will have the same time limit and backfilling is
impossible. Note documentation for the
SchedulerParameters option above. This is the default
configuration.<br>
<br>
sched/builtin<br>
This is the FIFO scheduler which
initiates jobs in priority order. If any job in the
partition can not be scheduled, no lower priority job in
that partition will be scheduled. An exception is made
for jobs that can not run due to partition constraints
(e.g. the time limit) or down/drained nodes. In that
case, lower priority jobs can be initiated and not impact
the higher priority job.<br>
<br>
<br>
<br>
Your partitions are set with maxtime=INFINITE, if your
users are not specifying a reasonable timelimit to their
jobs, this won't help either.<br>
<br>
<br>
-b<br>
<br>
</tt><br>
<div>On 4/24/20 1:52 PM, navin srivastava wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">In addition to the above when i see the
sprio of both the jobs it says :-
<div><br>
</div>
<div>for normal queue jobs all jobs showing the same
priority</div>
<div><br>
</div>
<div> JOBID PARTITION PRIORITY FAIRSHARE<br>
1291352 normal 15789 15789<br>
</div>
<div><br>
</div>
<div>for GPUsmall all jobs showing the same priority.</div>
<div><br>
</div>
<div> JOBID PARTITION PRIORITY FAIRSHARE<br>
1291339 GPUsmall 21052 21053<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Apr 24, 2020
at 11:14 PM navin srivastava <<a href="mailto:navin.altair@gmail.com" rel="noreferrer" target="_blank">navin.altair@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">Hi Team,<br>
<div><br>
</div>
<div>we are facing some issue in our environment.
The resources are free but job is going into the
QUEUE state but not running.</div>
<div><br>
</div>
<div>i have attached the slurm.conf file here.</div>
<div><br>
</div>
<div>scenario:-</div>
<div><br>
</div>
<div>There are job only in the 2 partitions:</div>
<div> 344 jobs are in PD state in normal partition
and the node belongs from the normal partitions
are full and no more job can run.</div>
<div><br>
</div>
<div>1300 JOBS are in GPUsmall partition are in
queue and enough CPU is avaiable to execute the
jobs but i see the jobs are not scheduling on free
nodes.</div>
<div><br>
</div>
<div>Rest there are no pend jobs in any other
partition .</div>
<div>eg:-</div>
<div>node status:- node18</div>
<div><br>
</div>
<div>NodeName=node18 Arch=x86_64 CoresPerSocket=18<br>
CPUAlloc=6 CPUErr=0 CPUTot=36 CPULoad=4.07<br>
AvailableFeatures=K2200<br>
ActiveFeatures=K2200<br>
Gres=gpu:2<br>
NodeAddr=node18 NodeHostName=node18
Version=17.11<br>
OS=Linux 4.4.140-94.42-default #1 SMP Tue Jul
17 07:44:50 UTC 2018 (0b375e4)<br>
RealMemory=1 AllocMem=0 FreeMem=79532 Sockets=2
Boards=1<br>
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1
Owner=N/A MCS_label=N/A<br>
Partitions=GPUsmall,pm_shared<br>
BootTime=2019-12-10T14:16:37
SlurmdStartTime=2019-12-10T14:24:08<br>
CfgTRES=cpu=36,mem=1M,billing=36<br>
AllocTRES=cpu=6<br>
CapWatts=n/a<br>
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<br>
ExtSensorsJoules=n/s ExtSensorsWatts=0
ExtSensorsTemp=n/s<br>
</div>
<div><br>
</div>
<div>node19:-</div>
<div><br>
</div>
<div>NodeName=node19 Arch=x86_64 CoresPerSocket=18<br>
CPUAlloc=16 CPUErr=0 CPUTot=36 CPULoad=15.43<br>
AvailableFeatures=K2200<br>
ActiveFeatures=K2200<br>
Gres=gpu:2<br>
NodeAddr=node19 NodeHostName=node19
Version=17.11<br>
OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct
31 12:25:04 UTC 2018 (3090901)<br>
RealMemory=1 AllocMem=0 FreeMem=63998 Sockets=2
Boards=1<br>
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1
Owner=N/A MCS_label=N/A<br>
Partitions=GPUsmall,pm_shared<br>
BootTime=2020-03-12T06:51:54
SlurmdStartTime=2020-03-12T06:53:14<br>
CfgTRES=cpu=36,mem=1M,billing=36<br>
AllocTRES=cpu=16<br>
CapWatts=n/a<br>
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<br>
ExtSensorsJoules=n/s ExtSensorsWatts=0
ExtSensorsTemp=n/s<br>
</div>
<div><br>
</div>
<div>could you please help me to understand what
could be the reason?</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br>
</div>
</blockquote>
</div>
</blockquote>
<br>
</div>
</blockquote></div>