[slurm-users] sbatch tasks stuck in queue when a job is hung

Robert Kudyba rkudyba at fordham.edu
Mon Jul 8 20:48:49 UTC 2019


Thanks Brian indeed we did have it set in bytes. I set it to the MB value. Hoping this takes care of the situation.

> On Jul 8, 2019, at 4:02 PM, Brian Andrus <toomuchit at gmail.com> wrote:
> 
> Your problem here is that the configuration for the nodes in question have an incorrect amount of memory set for them. Looks like you have it set in bytes instead of megabytes
> 
> In your slurm.conf you should look at the RealMemory setting:
> 
> 
> RealMemory
> Size of real memory on the node in megabytes (e.g. "2048"). The default value is 1. 
> 
> I would suggest RealMemory=191879 , where I suspect you have RealMemory=196489092
> 
> Brian Andrus
> On 7/8/2019 11:59 AM, Robert Kudyba wrote:
>> I’m new to Slurm and we have a 3 node + head node cluster running Centos 7 and Bright Cluster 8.1. Their support sent me here as they say Slurm is configured optimally to allow multiple tasks to run. However at times a job will hold up new jobs. Are there any other logs I can look at and/or settings to change to prevent this or alert me when this is happening? Here are some tests and commands that I hope will illuminate where I may be going wrong. The slurn.conf file has these options set:
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_CPU
>> SchedulerTimeSlice=60
>> 
>> I also see /var/log/slurmctld is loaded with errors like these:
>> [2019-07-03T02:21:30.913] error: _slurm_rpc_node_registration node=node003: Invalid argument
>> [2019-07-03T02:54:50.655] error: Node node002 has low real_memory size (191879 < 196489092)
>> [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node002: Invalid argument
>> [2019-07-03T02:54:50.655] error: Node node001 has low real_memory size (191883 < 196489092)
>> [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node001: Invalid argument
>> [2019-07-03T02:54:50.655] error: Node node003 has low real_memory size (191879 < 196489092)
>> [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node003: Invalid argument
>> [2019-07-03T03:28:10.293] error: Node node002 has low real_memory size (191879 < 196489092)
>> [2019-07-03T03:28:10.293] error: _slurm_rpc_node_registration node=node002: Invalid argument
>> [2019-07-03T03:28:10.293] error: Node node003 has low real_memory size (191879 < 196489092)
>> 
>> squeue
>> JOBID PARTITION NAME 	 USER  ST TIME NODES NODELIST(REASON)
>> 352   defq 	TensorFl myuser PD 0:00 3     (Resources)
>> 
>>  scontrol show jobid -dd 352
>> JobId=352 JobName=TensorFlowGPUTest
>> UserId=myuser(1001) GroupId=myuser(1001) MCS_label=N/A
>> Priority=4294901741 Nice=0 Account=(null) QOS=normal
>> JobState=PENDING Reason=Resources Dependency=(null)
>> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>> DerivedExitCode=0:0
>> RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
>> SubmitTime=2019-07-02T16:57:11 EligibleTime=2019-07-02T16:57:11
>> StartTime=Unknown EndTime=Unknown Deadline=N/A
>> PreemptTime=None SuspendTime=None SecsPreSuspend=0
>> LastSchedEval=2019-07-02T16:57:59
>> Partition=defq AllocNode:Sid=ourcluster:386851
>> ReqNodeList=(null) ExcNodeList=(null)
>> NodeList=(null)
>> NumNodes=3-3 NumCPUs=3 NumTasks=3 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>> TRES=cpu=3,node=3
>> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>> MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>> Features=(null) DelayBoot=00:00:00
>> Gres=gpu:1 Reservation=(null)
>> OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
>> Command=/home/myuser/cnn_gpu.sh
>> WorkDir=/home/myuser
>> StdErr=/home/myuser/slurm-352.out
>> StdIn=/dev/null
>> StdOut=/home/myuser/slurm-352.out
>> Power=
>> 
>> Another test showed the below:
>> sinfo -N
>> NODELIST   NODES PARTITION STATE
>> node001        1     defq*    drain
>> node002        1     defq*    drain
>> node003        1     defq*    drain
>> 
>> sinfo -R
>> REASON               USER      TIMESTAMP           NODELIST
>> Low RealMemory       slurm     2019-05-17T10:05:26 node[001-003]
>> 
>> 
>> [ciscluster]% jobqueue
>> [ciscluster->jobqueue(slurm)]% ls
>> Type Name Nodes
>> ------------ ------------------------
>> ----------------------------------------------------
>> Slurm defq node001..node003
>> Slurm gpuq
>> [ourcluster->jobqueue(slurm)]% use defq
>> [ourcluster->jobqueue(slurm)->defq]% get options
>> QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeL imit=0 State=UP
>> 
>> pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'" 
>> node003: Thread(s) per core: 1 
>> node003: Core(s) per socket: 12 
>> node003: Socket(s): 2 
>> node001: Thread(s) per core: 1 
>> node001: Core(s) per socket: 12 
>> node001: Socket(s): 2 
>> node002: Thread(s) per core: 1 
>> node002: Core(s) per socket: 12 
>> node002: Socket(s): 2 
>> 
>> scontrol show nodes node001 
>> NodeName=node001 Arch=x86_64 CoresPerSocket=12 
>> CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 
>> AvailableFeatures=(null) 
>> Ac tiveFeatures=(null) 
>> Gres=gpu:1 
>> NodeAddr=node001 NodeHostName=node001 Version=17.11 
>> OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018 
>> RealMemory=196489092 AllocMem=0 FreeMem=184912 Sockets=2 Boards=1 
>> State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A 
>> Partitions=defq 
>> BootTime=2019-06-28T15:33:47 SlurmdStartTime=2019-06-28T15:35:17 
>> CfgTRES=cpu=24,mem=196489092M,billing=24 
>> AllocTRES= 
>> CapWatts=n/a 
>> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 
>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s 
>> Reason=Low RealMemory [slurm at 2019-05-17T10:05:26] 
>> 
>> 
>> sinfo 
>> PARTITION AVAIL TI MELIMIT NODES STATE NODELIST 
>> defq* up infinite 3 drain node[001-003] 
>> gpuq up infinite 0 n/a 
>> 
>> 
>> scontrol show nodes| grep -i mem 
>> RealMemory=196489092 AllocMem=0 FreeMem=184907 Sockets=2 Boards=1 
>> CfgTRES=cpu=24,mem=196489092M,billing=24 
>> Reason=Low RealMemory [slurm at 2019-05-17T10:05:26] 
>> RealMemory=196489092 AllocMem=0 FreeMem=185084 Sockets=2 Boards=1 
>> CfgTRES=cpu=24,mem=196489092M,billing=24 
>> Reason=Low RealMemory [slurm at 2019-05-17T10:05:26] 
>> RealMemory=196489092 AllocMem=0 FreeMem=188720 Sockets=2 Boards=1 
>> CfgTRES=cpu=24,mem=196489092M,billing=24 
>> Reason=Low RealMemory [slurm at 2019-05-17T10:05:26] 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190708/f9bda134/attachment-0001.htm>


More information about the slurm-users mailing list