[slurm-users] not allocating the node for job execution even resources are available.
navin srivastava
navin.altair at gmail.com
Tue Mar 31 11:04:43 UTC 2020
Hi ,
have an issue with the resource allocation.
In the environment have partition like below:
PartitionName=small_jobs Nodes=Node[17,20] Default=NO MaxTime=INFINITE
State=UP Shared=YES Priority=8000
PartitionName=large_jobs Nodes=Node[17,20] Default=NO MaxTime=INFINITE
State=UP Shared=YES Priority=100
Also the node allocated with less cpu and lot of cpu resources available
NodeName=Node17 Arch=x86_64 CoresPerSocket=18
CPUAlloc=4 CPUErr=0 CPUTot=36 CPULoad=4.09
AvailableFeatures=K2200
ActiveFeatures=K2200
Gres=gpu:2
NodeAddr=Node1717 NodeHostName=Node17 Version=17.11
OS=Linux 4.12.14-94.41-default #1 SMP Wed Oct 31 12:25:04 UTC 2018
(3090901)
RealMemory=1 AllocMem=0 FreeMem=225552 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=small_jobs,large_jobs
BootTime=2020-03-21T18:56:48 SlurmdStartTime=2020-03-31T09:07:03
CfgTRES=cpu=36,mem=1M,billing=36
AllocTRES=cpu=4
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
there is no other job in small_jobs partition but several jobs are in
pending in the large_jobs and the resources are available but jobs are not
going through.
one of the job pening output is:
scontrol show job 1250258
JobId=1250258 JobName=import_workflow
UserId=m209767(100468) GroupId=oled(4289) MCS_label=N/A
Priority=363157 Nice=0 Account=oledgrp QOS=normal
JobState=PENDING Reason=Priority Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2020-03-28T22:00:13 EligibleTime=2020-03-28T22:00:13
StartTime=2070-03-19T11:59:09 EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2020-03-31T12:58:48
Partition=large_jobs AllocNode:Sid=deda1x1466:62260
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
this is my slurm.conf file for scheduling.
SchedulerType=sched/builtin
#SchedulerParameters=enable_user_top
SelectType=select/cons_res
#SelectTypeParameters=CR_Core_Memory
SelectTypeParameters=CR_Core
Any idea why the job is not going for execution if cpu cores are avaiable.
Also would like to know if any jobs are running on a particular node and if
i restart the Slurmd service then in what scenario my job will get killed.
Generally it should not kill the job.
Regards
Navin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200331/49701888/attachment.htm>
More information about the slurm-users
mailing list