[slurm-users] not allocating jobs even resources are free

Mon May 4 12:11:52 UTC 2020

Thanks Denial for detailed  Description

Regards
Navin

On Sun, May 3, 2020, 13:35 Daniel Letai <dani at letai.org.il> wrote:

>
> On 29/04/2020 12:00:13, navin srivastava wrote:
>
> Thanks Daniel.
>
> All jobs went into run state so unable to provide the details but
> definitely will reach out later if we see similar issue.
>
> i am more interested to understand the FIFO with Fair Tree.it will be good
> if anybody provide some insight on this combination and also if we will
> enable the backfilling here how the behaviour will change.
>
> what is the role of the Fair tree here?
>
> Fair tree is the algorithm used to calculate the interim priority, before
> applying weight, but I think after the halflife decay.
>
>
> To make it simple - fifo without fairshare would assign priority based
> only on submission time. With faishare, that naive priority is adjusted
> based on prior usage by the applicable entities (users/departments -
> accounts).
>
>
> Backfill will let you utilize your resources better, since it will allow
> "inserting" low priority jobs before higher priority jobs, provided all
> jobs have defined wall times, and any inserted job doesn't affect in any
> way the start time of a higher priority job, thus allowing utilization of
> "holes" when the scheduler waits for resources to free up, in order to
> insert some large job.
>
>
> Suppose the system is at 60% utilization of cores, and the next fifo job
> requires 42% - it will wait until 2% are free so it can begin, meanwhile
> not allowing any job to start, even if it would tke only 30% of the
> resources (whic are currently free) and would finish before the 2% are free
> anyway.
>
> Backfill would allow such job to start, as long as it's wall time ensures
> it would finish before the 42% job would've started.
>
>
> Fairtree in either case (fifo or backfill) calculates the priority for
> each job the same - if the account had used more resources recently (the
> halflife decay factor) it would get a lower priority even though it was
> submitted earlier than a job from an account that didn't use any resources
> recently.
>
>
> As can be expected, backtree has to loop over all jobs in the queue, in
> order to see if any job can fit out of order. In very busy/active systems,
> that can lead to poor response times, unless tuned correctly in slurm conf
> - look at SchedulerParameters, all params starting with bf_ and in
> particular bf_max_job_test= ,bf_max_time= and bf_continue (but bf_window=
> can also have some impact if set too high).
>
> see the man page at
> https://slurm.schedmd.com/slurm.conf.html#OPT_SchedulerParameters
>
>
> PriorityType=priority/multifactor
> PriorityDecayHalfLife=2
> PriorityUsageResetPeriod=DAILY
> PriorityWeightFairshare=500000
> PriorityFlags=FAIR_TREE
>
> Regards
> Navin.
>
>
>
> On Mon, Apr 27, 2020 at 9:37 PM Daniel Letai <dani at letai.org.il> wrote:
>
>> Are you sure there are enough resources available? The node is in mixed
>> state, so it's configured for both partitions - it's possible that earlier
>> lower priority jobs are already running thus blocking the later jobs,
>> especially since it's fifo.
>>
>>
>> It would really help if you pasted the results of:
>>
>> squeue
>>
>> sinfo
>>
>>
>> As well as the exact sbatch line, so we can see how many resources per
>> node are requested.
>>
>>
>> On 26/04/2020 12:00:06, navin srivastava wrote:
>>
>> Thanks Brian,
>>
>> As suggested i gone through document and what i understood  that the fair
>> tree leads to the Fairshare mechanism and based on that the job should be
>> scheduling.
>>
>> so it mean job scheduling will be based on FIFO but priority will be
>> decided on the Fairshare. i am not sure if both conflicts here.if i see the
>> normal jobs priority is lower than the GPUsmall priority. so resources are
>> available with gpusmall partition then it should go. there is no job pend
>> due to gpu resources. the gpu resources itself not asked with the job.
>>
>> is there any article where i can see how the fairshare works and which
>> are setting should not be conflict with this.
>> According to document it never says that if fair-share is applied then
>> FIFO should be disabled.
>>
>> Regards
>> Navin.
>>
>>
>>
>>
>>
>> On Sat, Apr 25, 2020 at 12:47 AM Brian W. Johanson <bjohanso at psc.edu>
>> wrote:
>>
>>>
>>> If you haven't looked at the man page for slurm.conf, it will answer
>>> most if not all your questions.
>>> https://slurm.schedmd.com/slurm.conf.html but I would depend on the the
>>> manual version that was distributed with the version you have installed as
>>> options do change.
>>>
>>> There is a ton of information that is tedious to get through but reading
>>> through it multiple times opens many doors.
>>>
>>> DefaultTime is listed in there as a Partition option.
>>> If you are scheduling gres/gpu resources, it's quite possible there are
>>> cores available with no corresponding gpus avail.
>>>
>>> -b
>>>
>>> On 4/24/20 2:49 PM, navin srivastava wrote:
>>>
>>> Thanks Brian.
>>>
>>> I need  to check the jobs order.
>>>
>>> Is there  any way to define the default timeline of the job if user  not
>>> specifying time limit.
>>>
>>> Also what does the meaning of fairtree  in priorities in slurm.Conf
>>> file.
>>>
>>> The set of nodes are different in partitions.FIFO  does  not care for
>>> any  partitiong.
>>> Is it like strict odering means the job came 1st will go and until  it
>>> runs it will  not allow others.
>>>
>>> Also priorities is high for gpusmall partition and low for normal jobs
>>> and the nodes of the normal partition is full but gpusmall cores are
>>> available.
>>>
>>> Regards
>>> Navin
>>>
>>> On Fri, Apr 24, 2020, 23:49 Brian W. Johanson <bjohanso at psc.edu> wrote:
>>>
>>>> Without seeing the jobs in your queue, I would expect the next job in
>>>> FIFO order to be too large to fit in the current idle resources.
>>>>
>>>> Configure it to use the backfill scheduler:
>>>> SchedulerType=sched/backfill
>>>>
>>>>       SchedulerType
>>>>               Identifies  the type of scheduler to be used.  Note the
>>>> slurmctld daemon must be restarted for a change in scheduler type to become
>>>> effective (reconfiguring a running daemon has no effect for this
>>>> parameter).  The scontrol command can be used to manually change job
>>>> priorities if desired.  Acceptable values include:
>>>>
>>>>               sched/backfill
>>>>                      For a backfill scheduling module to augment the
>>>> default FIFO scheduling.  Backfill scheduling will initiate lower-priority
>>>> jobs if doing so does not delay the expected initiation time of any
>>>> higher  priority  job.   Effectiveness  of  backfill scheduling is
>>>> dependent upon users specifying job time limits, otherwise all jobs will
>>>> have the same time limit and backfilling is impossible.  Note documentation
>>>> for the SchedulerParameters option above.  This is the default
>>>> configuration.
>>>>
>>>>               sched/builtin
>>>>                      This  is  the  FIFO scheduler which initiates jobs
>>>> in priority order.  If any job in the partition can not be scheduled, no
>>>> lower priority job in that partition will be scheduled.  An exception is
>>>> made for jobs that can not run due to partition constraints (e.g. the time
>>>> limit) or down/drained nodes.  In that case, lower priority jobs can be
>>>> initiated and not impact the higher priority job.
>>>>
>>>>
>>>>
>>>> Your partitions are set with maxtime=INFINITE, if your users are not
>>>> specifying a reasonable timelimit to their jobs, this won't help either.
>>>>
>>>>
>>>> -b
>>>>
>>>>
>>>> On 4/24/20 1:52 PM, navin srivastava wrote:
>>>>
>>>> In addition to the above when i see the sprio of both the jobs it says
>>>> :-
>>>>
>>>> for normal queue jobs all jobs showing the same priority
>>>>
>>>>  JOBID PARTITION   PRIORITY  FAIRSHARE
>>>>         1291352 normal           15789      15789
>>>>
>>>> for GPUsmall all jobs showing the same priority.
>>>>
>>>>  JOBID PARTITION   PRIORITY  FAIRSHARE
>>>>         1291339 GPUsmall      21052      21053
>>>>
>>>> On Fri, Apr 24, 2020 at 11:14 PM navin srivastava <
>>>> navin.altair at gmail.com> wrote:
>>>>
>>>>> Hi Team,
>>>>>
>>>>> we are facing some issue in our environment. The resources are free
>>>>> but job is going into the QUEUE state but not running.
>>>>>
>>>>> i have attached the slurm.conf file here.
>>>>>
>>>>> scenario:-
>>>>>
>>>>> There are job only in the 2 partitions:
>>>>>  344 jobs are in PD state in normal partition and the node belongs
>>>>> from the normal partitions are full and no more job can run.
>>>>>
>>>>> 1300 JOBS are in GPUsmall partition are in queue and enough CPU is
>>>>> avaiable to execute the jobs but i see the jobs are not scheduling on free
>>>>> nodes.
>>>>>
>>>>> Rest there are no pend jobs in any other partition .
>>>>> eg:-
>>>>> node status:- node18
>>>>>
>>>>> NodeName=node18 Arch=x86_64 CoresPerSocket=18
>>>>>    CPUAlloc=6 CPUErr=0 CPUTot=36 CPULoad=4.07
>>>>>    AvailableFeatures=K2200
>>>>>    ActiveFeatures=K2200
>>>>>    Gres=gpu:2
>>>>>    NodeAddr=node18 NodeHostName=node18 Version=17.11
>>>>>    OS=Linux 4.4.140-94.42-default #1 SMP Tue Jul 17 07:44:50 UTC 2018
>>>>> (0b375e4)
>>>>>    RealMemory=1 AllocMem=0 FreeMem=79532 Sockets=2 Boards=1
>>>>>    State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
>>>>> MCS_label=N/A
>>>>>    Partitions=GPUsmall,pm_shared
>>>>>    BootTime=2019-12-10T14:16:37 SlurmdStartTime=2019-12-10T14:24:08
>>>>>    CfgTRES=cpu=36,mem=1M,billing=36
>>>>>    AllocTRES=cpu=6
>>>>>    CapWatts=n/a
>>>>>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>>>>>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>>>>
>>>>> node19:-
>>>>>
>>>>> NodeName=node19 Arch=x86_64 CoresPerSocket=18
>>>>>    CPUAlloc=16 CPUErr=0 CPUTot=36 CPULoad=15.43
>>>>>    AvailableFeatures=K2200
>>>>>    ActiveFeatures=K2200
>>>>>    Gres=gpu:2
>>>>>    NodeAddr=node19 NodeHostName=node19 Version=17.11
>>>>>    OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct 31 12:25:04 UTC 2018
>>>>> (3090901)
>>>>>    RealMemory=1 AllocMem=0 FreeMem=63998 Sockets=2 Boards=1
>>>>>    State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
>>>>> MCS_label=N/A
>>>>>    Partitions=GPUsmall,pm_shared
>>>>>    BootTime=2020-03-12T06:51:54 SlurmdStartTime=2020-03-12T06:53:14
>>>>>    CfgTRES=cpu=36,mem=1M,billing=36
>>>>>    AllocTRES=cpu=16
>>>>>    CapWatts=n/a
>>>>>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>>>>>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>>>>>
>>>>> could you please help me to understand what could be the reason?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>> --
>> Regards,
>>
>> Daniel Letai
>> +972 (0)505 870 456
>>
>> --
> Regards,
>
> Daniel Letai
> +972 (0)505 870 456
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200504/7e5e1fc1/attachment-0001.htm>