[slurm-users] not allocating the node for job execution even resources are available.

navin srivastava navin.altair at gmail.com
Wed Apr 1 06:07:58 UTC 2020


In addition to the above problem . oversubscription is NO then according to
the document.so in this scenario even if resources are available it is  ot
accepting the job from other partition.  Even i made the same priority for
both the partition but it didn't help. Any Suggestion here.

Slurm Workload Manager - Sharing Consumable Resources
Two OverSubscribe=NO partitions assigned the same set of nodes Jobs from
either partition will be assigned to all available consumable resources. No
consumable resource will be shared. One node could have 2 jobs running on
it, and each job could be from a different partition.

On Tue, Mar 31, 2020 at 4:34 PM navin srivastava <navin.altair at gmail.com>
wrote:

> Hi ,
>
> have an issue with the resource allocation.
>
> In the environment have partition like below:
>
> PartitionName=small_jobs Nodes=Node[17,20]  Default=NO MaxTime=INFINITE
> State=UP Shared=YES Priority=8000
> PartitionName=large_jobs Nodes=Node[17,20]  Default=NO MaxTime=INFINITE
> State=UP Shared=YES Priority=100
>
> Also the node allocated with less cpu and lot of cpu resources available
>
> NodeName=Node17 Arch=x86_64 CoresPerSocket=18
>    CPUAlloc=4 CPUErr=0 CPUTot=36 CPULoad=4.09
>    AvailableFeatures=K2200
>    ActiveFeatures=K2200
>    Gres=gpu:2
>    NodeAddr=Node1717 NodeHostName=Node17 Version=17.11
>    OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct 31 12:25:04 UTC 2018
> (3090901)
>    RealMemory=1 AllocMem=0 FreeMem=225552 Sockets=2 Boards=1
>    State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>    Partitions=small_jobs,large_jobs
>    BootTime=2020-03-21T18:56:48 SlurmdStartTime=2020-03-31T09:07:03
>    CfgTRES=cpu=36,mem=1M,billing=36
>    AllocTRES=cpu=4
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> there is no other job in small_jobs partition but several jobs are in
> pending in the large_jobs and the resources are available but jobs are not
> going through.
>
> one of the job pening output is:
>
> scontrol show job 1250258
>    JobId=1250258 JobName=import_workflow
>    UserId=m209767(100468) GroupId=oled(4289) MCS_label=N/A
>    Priority=363157 Nice=0 Account=oledgrp QOS=normal
>    JobState=PENDING Reason=Priority Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
>    SubmitTime=2020-03-28T22:00:13 EligibleTime=2020-03-28T22:00:13
>    StartTime=2070-03-19T11:59:09 EndTime=Unknown Deadline=N/A
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    LastSchedEval=2020-03-31T12:58:48
>    Partition=large_jobs AllocNode:Sid=deda1x1466:62260
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=(null)
>    NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=1,node=1
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>    MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    Gres=(null) Reservation=(null)
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>
> this is my slurm.conf file for scheduling.
>
>
> SchedulerType=sched/builtin
> #SchedulerParameters=enable_user_top
> SelectType=select/cons_res
> #SelectTypeParameters=CR_Core_Memory
> SelectTypeParameters=CR_Core
>
>
> Any idea why the job is not going for execution if cpu cores are avaiable.
>
> Also would like to know if any jobs are running on a particular node and
> if i restart the Slurmd service then in what scenario my job will get
> killed. Generally it should not kill the job.
>
> Regards
> Navin.
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200401/5241d5db/attachment.htm>


More information about the slurm-users mailing list