[slurm-users] not allocating jobs even resources are free

Brian W. Johanson bjohanso at psc.edu
Wed Apr 29 19:15:04 UTC 2020


Navin,
Check out 'sprio', this will give show you how the job priority changes 
with the weight changes you are making.
-b

On 4/29/20 5:00 AM, navin srivastava wrote:
> Thanks Daniel.
> All jobs went into run state so unable to provide the details but 
> definitely will reach out later if we see similar issue.
>
> i am more interested to understand the FIFO with Fair Tree.it will be 
> good if anybody provide some insight on this combination and also if 
> we will enable the backfilling here how the behaviour will change.
>
> what is the role of the Fair tree here?
>
> PriorityType=priority/multifactor
> PriorityDecayHalfLife=2
> PriorityUsageResetPeriod=DAILY
> PriorityWeightFairshare=500000
> PriorityFlags=FAIR_TREE
>
> Regards
> Navin.
>
>
>
> On Mon, Apr 27, 2020 at 9:37 PM Daniel Letai <dani at letai.org.il 
> <mailto:dani at letai.org.il>> wrote:
>
>     Are you sure there are enough resources available? The node is in
>     mixed state, so it's configured for both partitions - it's
>     possible that earlier lower priority jobs are already running thus
>     blocking the later jobs, especially since it's fifo.
>
>
>     It would really help if you pasted the results of:
>
>     squeue
>
>     sinfo
>
>
>     As well as the exact sbatch line, so we can see how many resources
>     per node are requested.
>
>
>     On 26/04/2020 12:00:06, navin srivastava wrote:
>>     Thanks Brian,
>>
>>     As suggested i gone through document and what i understood  that
>>     the fair tree leads to the Fairshare mechanism and based on that
>>     the job should be scheduling.
>>
>>     so it mean job scheduling will be based on FIFO but priority will
>>     be decided on the Fairshare. i am not sure if both conflicts
>>     here.if i see the normal jobs priority is lower than the GPUsmall
>>     priority. so resources are available with gpusmall partition then
>>     it should go. there is no job pend due to gpu resources. the gpu
>>     resources itself not asked with the job.
>>
>>     is there any article where i can see how the fairshare works and
>>     which are setting should not be conflict with this.
>>     According to document it never says that if fair-share is applied
>>     then FIFO should be disabled.
>>
>>     Regards
>>     Navin.
>>
>>
>>
>>
>>
>>     On Sat, Apr 25, 2020 at 12:47 AM Brian W. Johanson
>>     <bjohanso at psc.edu <mailto:bjohanso at psc.edu>> wrote:
>>
>>
>>         If you haven't looked at the man page for slurm.conf, it will
>>         answer most if not all your questions.
>>         https://slurm.schedmd.com/slurm.conf.html but I would depend
>>         on the the manual version that was distributed with the
>>         version you have installed as options do change.
>>
>>         There is a ton of information that is tedious to get through
>>         but reading through it multiple times opens many doors.
>>
>>         DefaultTime is listed in there as a Partition option.
>>         If you are scheduling gres/gpu resources, it's quite possible
>>         there are cores available with no corresponding gpus avail.
>>
>>         -b
>>
>>         On 4/24/20 2:49 PM, navin srivastava wrote:
>>>         Thanks Brian.
>>>
>>>         I need  to check the jobs order.
>>>
>>>         Is there  any way to define the default timeline of the job
>>>         if user  not specifying time limit.
>>>
>>>         Also what does the meaning of fairtree  in priorities in
>>>         slurm.Conf file.
>>>
>>>         The set of nodes are different in partitions.FIFO  does  not
>>>         care for any partitiong.
>>>         Is it like strict odering means the job came 1st will go and
>>>         until  it runs it will  not allow others.
>>>
>>>         Also priorities is high for gpusmall partition and low for
>>>         normal jobs and the nodes of the normal partition is full
>>>         but gpusmall cores are available.
>>>
>>>         Regards
>>>         Navin
>>>
>>>         On Fri, Apr 24, 2020, 23:49 Brian W. Johanson
>>>         <bjohanso at psc.edu <mailto:bjohanso at psc.edu>> wrote:
>>>
>>>             Without seeing the jobs in your queue, I would expect
>>>             the next job in FIFO order to be too large to fit in the
>>>             current idle resources.
>>>
>>>             Configure it to use the backfill scheduler:
>>>             SchedulerType=sched/backfill
>>>
>>>                   SchedulerType
>>>                           Identifies  the type of scheduler to be
>>>             used.  Note the slurmctld daemon must be restarted for a
>>>             change in scheduler type to become effective
>>>             (reconfiguring a running daemon has no effect for this
>>>             parameter).  The scontrol command can be used to
>>>             manually change job priorities if desired.  Acceptable
>>>             values include:
>>>
>>>                           sched/backfill
>>>                                  For a backfill scheduling module to
>>>             augment the default FIFO scheduling.  Backfill
>>>             scheduling will initiate lower-priority jobs if doing so
>>>             does not delay the expected initiation time of any 
>>>             higher  priority  job. Effectiveness  of  backfill
>>>             scheduling is dependent upon users specifying job time
>>>             limits, otherwise all jobs will have the same time limit
>>>             and backfilling is impossible.  Note documentation for
>>>             the SchedulerParameters option above.  This is the
>>>             default configuration.
>>>
>>>                           sched/builtin
>>>                                  This  is  the  FIFO scheduler which
>>>             initiates jobs in priority order.  If any job in the
>>>             partition can not be scheduled, no lower priority job in
>>>             that partition will be scheduled.  An exception is made
>>>             for jobs that can not run due to partition constraints
>>>             (e.g. the time limit) or down/drained nodes.  In that
>>>             case, lower priority jobs can be initiated and not
>>>             impact the higher priority job.
>>>
>>>
>>>
>>>             Your partitions are set with maxtime=INFINITE, if your
>>>             users are not specifying a reasonable timelimit to their
>>>             jobs, this won't help either.
>>>
>>>
>>>             -b
>>>
>>>
>>>             On 4/24/20 1:52 PM, navin srivastava wrote:
>>>>             In addition to the above when i see the sprio of both
>>>>             the jobs it says :-
>>>>
>>>>             for normal queue jobs all jobs showing the same priority
>>>>
>>>>              JOBID PARTITION   PRIORITY  FAIRSHARE
>>>>                     1291352 normal           15789      15789
>>>>
>>>>             for GPUsmall all jobs showing the same priority.
>>>>
>>>>              JOBID PARTITION   PRIORITY  FAIRSHARE
>>>>                     1291339 GPUsmall      21052    21053
>>>>
>>>>             On Fri, Apr 24, 2020 at 11:14 PM navin srivastava
>>>>             <navin.altair at gmail.com
>>>>             <mailto:navin.altair at gmail.com>> wrote:
>>>>
>>>>                 Hi Team,
>>>>
>>>>                 we are facing some issue in our environment. The
>>>>                 resources are free but job is going into the QUEUE
>>>>                 state but not running.
>>>>
>>>>                 i have attached the slurm.conf file here.
>>>>
>>>>                 scenario:-
>>>>
>>>>                 There are job only in the 2 partitions:
>>>>                  344 jobs are in PD state in normal partition and
>>>>                 the node belongs from the normal partitions are
>>>>                 full and no more job can run.
>>>>
>>>>                 1300 JOBS are in GPUsmall partition are in queue
>>>>                 and enough CPU is avaiable to execute the jobs but
>>>>                 i see the jobs are not scheduling on free nodes.
>>>>
>>>>                 Rest there are no pend jobs in any other partition .
>>>>                 eg:-
>>>>                 node status:- node18
>>>>
>>>>                 NodeName=node18 Arch=x86_64 CoresPerSocket=18
>>>>                    CPUAlloc=6 CPUErr=0 CPUTot=36 CPULoad=4.07
>>>>                    AvailableFeatures=K2200
>>>>                    ActiveFeatures=K2200
>>>>                    Gres=gpu:2
>>>>                    NodeAddr=node18 NodeHostName=node18 Version=17.11
>>>>                    OS=Linux 4.4.140-94.42-default #1 SMP Tue Jul 17
>>>>                 07:44:50 UTC 2018 (0b375e4)
>>>>                    RealMemory=1 AllocMem=0 FreeMem=79532 Sockets=2
>>>>                 Boards=1
>>>>                    State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1
>>>>                 Owner=N/A MCS_label=N/A
>>>>                    Partitions=GPUsmall,pm_shared
>>>>                    BootTime=2019-12-10T14:16:37
>>>>                 SlurmdStartTime=2019-12-10T14:24:08
>>>>                  CfgTRES=cpu=36,mem=1M,billing=36
>>>>                    AllocTRES=cpu=6
>>>>                    CapWatts=n/a
>>>>                    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>>>>                    ExtSensorsJoules=n/s ExtSensorsWatts=0
>>>>                 ExtSensorsTemp=n/s
>>>>
>>>>                 node19:-
>>>>
>>>>                 NodeName=node19 Arch=x86_64 CoresPerSocket=18
>>>>                    CPUAlloc=16 CPUErr=0 CPUTot=36 CPULoad=15.43
>>>>                    AvailableFeatures=K2200
>>>>                    ActiveFeatures=K2200
>>>>                    Gres=gpu:2
>>>>                    NodeAddr=node19 NodeHostName=node19 Version=17.11
>>>>                    OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct 31
>>>>                 12:25:04 UTC 2018 (3090901)
>>>>                    RealMemory=1 AllocMem=0 FreeMem=63998 Sockets=2
>>>>                 Boards=1
>>>>                    State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1
>>>>                 Owner=N/A MCS_label=N/A
>>>>                    Partitions=GPUsmall,pm_shared
>>>>                    BootTime=2020-03-12T06:51:54
>>>>                 SlurmdStartTime=2020-03-12T06:53:14
>>>>                  CfgTRES=cpu=36,mem=1M,billing=36
>>>>                    AllocTRES=cpu=16
>>>>                    CapWatts=n/a
>>>>                    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>>>>                    ExtSensorsJoules=n/s ExtSensorsWatts=0
>>>>                 ExtSensorsTemp=n/s
>>>>
>>>>                 could you please help me to understand what could
>>>>                 be the reason?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>     -- 
>     Regards,
>
>     Daniel Letai
>     +972 (0)505 870 456
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200429/047fc499/attachment-0001.htm>


More information about the slurm-users mailing list