[slurm-users] sbatch tasks stuck in queue when a job is hung

Brian Andrus toomuchit at gmail.com
Mon Jul 8 20:02:20 UTC 2019


Your problem here is that the configuration for the nodes in question 
have an incorrect amount of memory set for them. Looks like you have it 
set in bytes instead of megabytes

In your slurm.conf you should look at the RealMemory setting:

*RealMemory*
    Size of real memory on the node in megabytes (e.g. "2048"). The
    default value is 1.

I would suggest RealMemory=191879 , where I suspect you have 
RealMemory=196489092

Brian Andrus

On 7/8/2019 11:59 AM, Robert Kudyba wrote:
> I’m new to Slurm and we have a 3 node + head node cluster running 
> Centos 7 and Bright Cluster 8.1. Their support sent me here as they 
> say Slurm is configured optimally to allow multiple tasks to run. 
> However at times a job will hold up new jobs. Are there any other logs 
> I can look at and/or settings to change to prevent this or alert me 
> when this is happening? Here are some tests and commands that I hope 
> will illuminate where I may be going wrong. The slurn.conf file has 
> these options set:
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU
> SchedulerTimeSlice=60
>
> I also see /var/log/slurmctld is loaded with errors like these:
> [2019-07-03T02:21:30.913] error: _slurm_rpc_node_registration 
> node=node003: Invalid argument
> [2019-07-03T02:54:50.655] error: Node node002 has low real_memory size 
> (191879 < 196489092)
> [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration 
> node=node002: Invalid argument
> [2019-07-03T02:54:50.655] error: Node node001 has low real_memory size 
> (191883 < 196489092)
> [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration 
> node=node001: Invalid argument
> [2019-07-03T02:54:50.655] error: Node node003 has low real_memory size 
> (191879 < 196489092)
> [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration 
> node=node003: Invalid argument
> [2019-07-03T03:28:10.293] error: Node node002 has low real_memory size 
> (191879 < 196489092)
> [2019-07-03T03:28:10.293] error: _slurm_rpc_node_registration 
> node=node002: Invalid argument
> [2019-07-03T03:28:10.293] error: Node node003 has low real_memory size 
> (191879 < 196489092)
>
> squeue
> JOBID PARTITION NAME  USER  ST TIME NODES NODELIST(REASON)
> 352   defq TensorFl myuser PD 0:00 3 (Resources)
>
>  scontrol show jobid -dd 352
> JobId=352 JobName=TensorFlowGPUTest
> UserId=myuser(1001) GroupId=myuser(1001) MCS_label=N/A
> Priority=4294901741 Nice=0 Account=(null) QOS=normal
> JobState=PENDING Reason=Resources Dependency=(null)
> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
> DerivedExitCode=0:0
> RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
> SubmitTime=2019-07-02T16:57:11 EligibleTime=2019-07-02T16:57:11
> StartTime=Unknown EndTime=Unknown Deadline=N/A
> PreemptTime=None SuspendTime=None SecsPreSuspend=0
> LastSchedEval=2019-07-02T16:57:59
> Partition=defq AllocNode:Sid=ourcluster:386851
> ReqNodeList=(null) ExcNodeList=(null)
> NodeList=(null)
> NumNodes=3-3 NumCPUs=3 NumTasks=3 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
> TRES=cpu=3,node=3
> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
> Features=(null) DelayBoot=00:00:00
> Gres=gpu:1 Reservation=(null)
> OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
> Command=/home/myuser/cnn_gpu.sh
> WorkDir=/home/myuser
> StdErr=/home/myuser/slurm-352.out
> StdIn=/dev/null
> StdOut=/home/myuser/slurm-352.out
> Power=
>
> Another test showed the below:
> sinfo -N
> NODELIST   NODES PARTITION STATE
> node001        1     defq*    drain
> node002        1     defq*    drain
> node003        1     defq*    drain
>
> sinfo -R
> REASON               USER      TIMESTAMP           NODELIST
> Low RealMemory       slurm     2019-05-17T10:05:26 node[001-003]
>
>
> [ciscluster]% jobqueue
> [ciscluster->jobqueue(slurm)]% ls
> Type Name Nodes
> ------------ ------------------------
> ----------------------------------------------------
> Slurm defq node001..node003
> Slurm gpuq
> [ourcluster->jobqueue(slurm)]% use defq
> [ourcluster->jobqueue(slurm)->defq]% get options
> QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeL imit=0 State=UP
>
> pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
> node003: Thread(s) per core: 1
> node003: Core(s) per socket: 12
> node003: Socket(s): 2
> node001: Thread(s) per core: 1
> node001: Core(s) per socket: 12
> node001: Socket(s): 2
> node002: Thread(s) per core: 1
> node002: Core(s) per socket: 12
> node002: Socket(s): 2
>
> scontrol show nodes node001
> NodeName=node001 Arch=x86_64 CoresPerSocket=12
> CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01
> AvailableFeatures=(null)
> Ac tiveFeatures=(null)
> Gres=gpu:1
> NodeAddr=node001 NodeHostName=node001 Version=17.11
> OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
> RealMemory=196489092 AllocMem=0 FreeMem=184912 Sockets=2 Boards=1
> State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A 
> MCS_label=N/A
> Partitions=defq
> BootTime=2019-06-28T15:33:47 SlurmdStartTime=2019-06-28T15:35:17
> CfgTRES=cpu=24,mem=196489092M,billing=24
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> Reason=Low RealMemory [slurm at 2019-05-17T10:05:26]
>
>
> sinfo
> PARTITION AVAIL TI MELIMIT NODES STATE NODELIST
> defq* up infinite 3 drain node[001-003]
> gpuq up infinite 0 n/a
>
>
> scontrol show nodes| grep -i mem
> RealMemory=196489092 AllocMem=0 FreeMem=184907 Sockets=2 Boards=1
> CfgTRES=cpu=24,mem=196489092M,billing=24
> Reason=Low RealMemory [slurm at 2019-05-17T10:05:26]
> RealMemory=196489092 AllocMem=0 FreeMem=185084 Sockets=2 Boards=1
> CfgTRES=cpu=24,mem=196489092M,billing=24
> Reason=Low RealMemory [slurm at 2019-05-17T10:05:26]
> RealMemory=196489092 AllocMem=0 FreeMem=188720 Sockets=2 Boards=1
> CfgTRES=cpu=24,mem=196489092M,billing=24
> Reason=Low RealMemory [slurm at 2019-05-17T10:05:26]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190708/3571de0d/attachment-0001.htm>


More information about the slurm-users mailing list