[slurm-users] sbatch tasks stuck in queue when a job is hung
Robert Kudyba
rkudyba at fordham.edu
Mon Jul 8 18:59:25 UTC 2019
I’m new to Slurm and we have a 3 node + head node cluster running Centos 7 and Bright Cluster 8.1. Their support sent me here as they say Slurm is configured optimally to allow multiple tasks to run. However at times a job will hold up new jobs. Are there any other logs I can look at and/or settings to change to prevent this or alert me when this is happening? Here are some tests and commands that I hope will illuminate where I may be going wrong. The slurn.conf file has these options set:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
SchedulerTimeSlice=60
I also see /var/log/slurmctld is loaded with errors like these:
[2019-07-03T02:21:30.913] error: _slurm_rpc_node_registration node=node003: Invalid argument
[2019-07-03T02:54:50.655] error: Node node002 has low real_memory size (191879 < 196489092)
[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node002: Invalid argument
[2019-07-03T02:54:50.655] error: Node node001 has low real_memory size (191883 < 196489092)
[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node001: Invalid argument
[2019-07-03T02:54:50.655] error: Node node003 has low real_memory size (191879 < 196489092)
[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node003: Invalid argument
[2019-07-03T03:28:10.293] error: Node node002 has low real_memory size (191879 < 196489092)
[2019-07-03T03:28:10.293] error: _slurm_rpc_node_registration node=node002: Invalid argument
[2019-07-03T03:28:10.293] error: Node node003 has low real_memory size (191879 < 196489092)
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
352 defq TensorFl myuser PD 0:00 3 (Resources)
scontrol show jobid -dd 352
JobId=352 JobName=TensorFlowGPUTest
UserId=myuser(1001) GroupId=myuser(1001) MCS_label=N/A
Priority=4294901741 Nice=0 Account=(null) QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2019-07-02T16:57:11 EligibleTime=2019-07-02T16:57:11
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-07-02T16:57:59
Partition=defq AllocNode:Sid=ourcluster:386851
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=3-3 NumCPUs=3 NumTasks=3 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=3,node=3
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=gpu:1 Reservation=(null)
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
Command=/home/myuser/cnn_gpu.sh
WorkDir=/home/myuser
StdErr=/home/myuser/slurm-352.out
StdIn=/dev/null
StdOut=/home/myuser/slurm-352.out
Power=
Another test showed the below:
sinfo -N
NODELIST NODES PARTITION STATE
node001 1 defq* drain
node002 1 defq* drain
node003 1 defq* drain
sinfo -R
REASON USER TIMESTAMP NODELIST
Low RealMemory slurm 2019-05-17T10:05:26 node[001-003]
[ciscluster]% jobqueue
[ciscluster->jobqueue(slurm)]% ls
Type Name Nodes
------------ ------------------------
----------------------------------------------------
Slurm defq node001..node003
Slurm gpuq
[ourcluster->jobqueue(slurm)]% use defq
[ourcluster->jobqueue(slurm)->defq]% get options
QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeL imit=0 State=UP
pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
node003: Thread(s) per core: 1
node003: Core(s) per socket: 12
node003: Socket(s): 2
node001: Thread(s) per core: 1
node001: Core(s) per socket: 12
node001: Socket(s): 2
node002: Thread(s) per core: 1
node002: Core(s) per socket: 12
node002: Socket(s): 2
scontrol show nodes node001
NodeName=node001 Arch=x86_64 CoresPerSocket=12
CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01
AvailableFeatures=(null)
Ac tiveFeatures=(null)
Gres=gpu:1
NodeAddr=node001 NodeHostName=node001 Version=17.11
OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
RealMemory=196489092 AllocMem=0 FreeMem=184912 Sockets=2 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=defq
BootTime=2019-06-28T15:33:47 SlurmdStartTime=2019-06-28T15:35:17
CfgTRES=cpu=24,mem=196489092M,billing=24
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [slurm at 2019-05-17T10:05:26]
sinfo
PARTITION AVAIL TI MELIMIT NODES STATE NODELIST
defq* up infinite 3 drain node[001-003]
gpuq up infinite 0 n/a
scontrol show nodes| grep -i mem
RealMemory=196489092 AllocMem=0 FreeMem=184907 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm at 2019-05-17T10:05:26]
RealMemory=196489092 AllocMem=0 FreeMem=185084 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm at 2019-05-17T10:05:26]
RealMemory=196489092 AllocMem=0 FreeMem=188720 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm at 2019-05-17T10:05:26]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190708/9c627fd1/attachment-0001.htm>
More information about the slurm-users
mailing list