[slurm-users] jobs stuck in ReqNodeNotAvail,
Christian Anthon
anthon at rth.dk
Wed Nov 29 08:21:36 MST 2017
Hi,
I have a problem with a newly setup slurm-17.02.7-1.el6.x86_64 that jobs
seems to be stuck in ReqNodeNotAvail:
6982 panic Morgens ferro PD 0:00 1
(ReqNodeNotAvail, UnavailableNodes:)
6981 panic SPEC ferro PD 0:00 1
(ReqNodeNotAvail, UnavailableNodes:)
The nodes are fully allocated in terms of memory, but not all cpu
resources are consumed
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
_default up infinite 19 mix
clone[05-11,25-29,31-32,36-37,39-40,45]
_default up infinite 11 alloc alone[02-08,10-13]
fastlane up infinite 19 mix
clone[05-11,25-29,31-32,36-37,39-40,45]
fastlane up infinite 11 alloc alone[02-08,10-13]
panic up infinite 19 mix
clone[05-11,25-29,31-32,36-37,39-40,45]
panic up infinite 12 alloc alone[02-08,10-13,15]
free* up infinite 19 mix
clone[05-11,25-29,31-32,36-37,39-40,45]
free* up infinite 11 alloc alone[02-08,10-13]
Possibly relevant lines in slurm.conf (full slurm.conf attached)
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
TaskPlugin=task/none
FastSchedule=1
Any advice?
Cheers, Christian.
-------------- next part --------------
# Maintained by PUPPET, local edits will be lost
#General
ClusterName=rth
ControlMachine=rnai01
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
Proctracktype=proctrack/linuxproc
ReturnToService=1
MaxJobCount=10000
#prolog/epilog
Prolog=/etc/slurm/scripts/slurm.prolog
Epilog=/etc/slurm/scripts/slurm.epilog
TaskProlog=/etc/slurm/scripts/slurm.task.prolog
TaskEpilog=/etc/slurm/scripts/slurm.task.epilog
SrunProlog=/etc/slurm/scripts/slurm.srun.prolog
SrunEpilog=/etc/slurm/scripts/slurm.srun.epilog
#TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
TaskPlugin=task/none
FastSchedule=1
#Job priority
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityFavorSmall=NO
PriorityMaxAge=14-0
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0 # don't use the qos factor
#LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
AccountingStorageEnforce=limits,qos
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=rnai01
#Job defaults
DefMemPerCPU=1024
#Privacy
PrivateData=accounts,jobs,reservations,usage,users
UsePAM=1
#COMPUTE NODES
TmpFS=/tmp
NodeName=alone[02-08,10-13,15] Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=64386 TmpDisk=426554 State=UNKNOWN
NodeName=clone[05-11,25-29,31-32,36-37,39-40,45] Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=24019 TmpDisk=446453 State=UNKNOWN
#Partitions
PartitionName=_default Nodes=alone[02-08,10-13],clone[05-11,25-29,31-32,36-37,39-40,45] PriorityJobFactor=2 AllowAccounts=rth AllowGroups=rth ExclusiveUser=YES Default=NO DefaultTime=24:00:00 MaxTime=INFINITE State=UP
PartitionName=fastlane Nodes=alone[02-08,10-13],clone[05-11,25-29,31-32,36-37,39-40,45] PriorityJobFactor=10 AllowAccounts=rth AllowGroups=rth ExclusiveUser=YES Default=NO DefaultTime=24:00:00 MaxTime=INFINITE State=UP
PartitionName=panic Nodes=alone[02-08,10-13,15],clone[05-11,25-29,31-32,36-37,39-40,45] PriorityJobFactor=100 AllowAccounts=rth AllowGroups=rth ExclusiveUser=YES Default=NO DefaultTime=24:00:00 MaxTime=INFINITE State=UP
PartitionName=free Nodes=alone[02-08,10-13],clone[05-11,25-29,31-32,36-37,39-40,45] PriorityJobFactor=1 AllowAccounts=ALL AllowGroups=ALL ExclusiveUser=YES Default=YES DefaultTime=24:00:00 MaxTime=INFINITE State=UP
More information about the slurm-users
mailing list