[slurm-users] sbatch tasks stuck in queue when a job is hung

Mon Jul 8 18:59:25 UTC 2019

I’m new to Slurm and we have a 3 node + head node cluster running Centos 7 and Bright Cluster 8.1. Their support sent me here as they say Slurm is configured optimally to allow multiple tasks to run. However at times a job will hold up new jobs. Are there any other logs I can look at and/or settings to change to prevent this or alert me when this is happening? Here are some tests and commands that I hope will illuminate where I may be going wrong. The slurn.conf file has these options set:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
SchedulerTimeSlice=60

I also see /var/log/slurmctld is loaded with errors like these:
[2019-07-03T02:21:30.913] error: _slurm_rpc_node_registration node=node003: Invalid argument
[2019-07-03T02:54:50.655] error: Node node002 has low real_memory size (191879 < 196489092)
[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node002: Invalid argument
[2019-07-03T02:54:50.655] error: Node node001 has low real_memory size (191883 < 196489092)
[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node001: Invalid argument
[2019-07-03T02:54:50.655] error: Node node003 has low real_memory size (191879 < 196489092)
[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node003: Invalid argument
[2019-07-03T03:28:10.293] error: Node node002 has low real_memory size (191879 < 196489092)
[2019-07-03T03:28:10.293] error: _slurm_rpc_node_registration node=node002: Invalid argument
[2019-07-03T03:28:10.293] error: Node node003 has low real_memory size (191879 < 196489092)

squeue
JOBID PARTITION NAME 	 USER  ST TIME NODES NODELIST(REASON)
352   defq 	TensorFl myuser PD 0:00 3     (Resources)

 scontrol show jobid -dd 352
JobId=352 JobName=TensorFlowGPUTest
UserId=myuser(1001) GroupId=myuser(1001) MCS_label=N/A
Priority=4294901741 Nice=0 Account=(null) QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2019-07-02T16:57:11 EligibleTime=2019-07-02T16:57:11
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-07-02T16:57:59
Partition=defq AllocNode:Sid=ourcluster:386851
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=3-3 NumCPUs=3 NumTasks=3 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=3,node=3
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=gpu:1 Reservation=(null)
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
Command=/home/myuser/cnn_gpu.sh
WorkDir=/home/myuser
StdErr=/home/myuser/slurm-352.out
StdIn=/dev/null
StdOut=/home/myuser/slurm-352.out
Power=

Another test showed the below:
sinfo -N
NODELIST   NODES PARTITION STATE
node001        1     defq*    drain
node002        1     defq*    drain
node003        1     defq*    drain

sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Low RealMemory       slurm     2019-05-17T10:05:26 node[001-003]

[ciscluster]% jobqueue
[ciscluster->jobqueue(slurm)]% ls
Type Name Nodes
------------ ------------------------
----------------------------------------------------
Slurm defq node001..node003
Slurm gpuq
[ourcluster->jobqueue(slurm)]% use defq
[ourcluster->jobqueue(slurm)->defq]% get options
QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeL imit=0 State=UP

pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'" 
node003: Thread(s) per core: 1 
node003: Core(s) per socket: 12 
node003: Socket(s): 2 
node001: Thread(s) per core: 1 
node001: Core(s) per socket: 12 
node001: Socket(s): 2 
node002: Thread(s) per core: 1 
node002: Core(s) per socket: 12 
node002: Socket(s): 2 

scontrol show nodes node001 
NodeName=node001 Arch=x86_64 CoresPerSocket=12 
CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 
AvailableFeatures=(null) 
Ac tiveFeatures=(null) 
Gres=gpu:1 
NodeAddr=node001 NodeHostName=node001 Version=17.11 
OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018 
RealMemory=196489092 AllocMem=0 FreeMem=184912 Sockets=2 Boards=1 
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A 
Partitions=defq 
BootTime=2019-06-28T15:33:47 SlurmdStartTime=2019-06-28T15:35:17 
CfgTRES=cpu=24,mem=196489092M,billing=24 
AllocTRES= 
CapWatts=n/a 
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s 
Reason=Low RealMemory [slurm at 2019-05-17T10:05:26] 

sinfo 
PARTITION AVAIL TI MELIMIT NODES STATE NODELIST 
defq* up infinite 3 drain node[001-003] 
gpuq up infinite 0 n/a 

scontrol show nodes| grep -i mem 
RealMemory=196489092 AllocMem=0 FreeMem=184907 Sockets=2 Boards=1 
CfgTRES=cpu=24,mem=196489092M,billing=24 
Reason=Low RealMemory [slurm at 2019-05-17T10:05:26] 
RealMemory=196489092 AllocMem=0 FreeMem=185084 Sockets=2 Boards=1 
CfgTRES=cpu=24,mem=196489092M,billing=24 
Reason=Low RealMemory [slurm at 2019-05-17T10:05:26] 
RealMemory=196489092 AllocMem=0 FreeMem=188720 Sockets=2 Boards=1 
CfgTRES=cpu=24,mem=196489092M,billing=24 
Reason=Low RealMemory [slurm at 2019-05-17T10:05:26] 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190708/9c627fd1/attachment-0001.htm>