[slurm-users] Socket timed out on send/recv operation
Kirk Main
kjmain at ncsu.edu
Thu Oct 18 11:58:37 MDT 2018
Hi all,
I'm a new administrator to Slurm and I've just got my new cluster up and
running. We started getting a lot of "Socket timed out on send/recv
operation" errors when submitting jobs, and also if you try to "squeue"
while others are submitting jobs. The job does eventually run after about a
minute, but the entire system feels very sluggish and obviously this isn't
normal. Not sure whats going on here...
Head nodema-vm-slurm01 and ma-vm-slurm02 are virtual machines running on a
Hyper-V host with a common NFS share between all of the worker and head
nodes. Head nodes have 8CPU/8GB running on Ubuntu 16.04. All network
interconnect is 10GBE.
slurmctld Log Snippet:
Oct 15 11:08:57 ma-vm-slurm01 slurmctld[1603]: validate_node_specs: Node
ma-pm-hpc03 unexpectedly rebooted boot_time=1539616111 last
response=1539615694
Oct 15 12:40:21 ma-vm-slurm01 slurmctld[1603]: _slurm_rpc_submit_batch_job:
JobId=476 InitPrio=4294901364 usec=33061624
Oct 15 12:40:36 ma-vm-slurm01 slurmctld[1603]: email msg to mrgaddy at ncsu.edu:
Slurm Job_id=476 Name=StochUnifOpt_Chu_3D_forCluster.m.job Began, Queued
time 00:00:48
Oct 15 12:40:36 ma-vm-slurm01 slurmctld[1603]: sched: Allocate JobId=476
NodeList=ma-pm-hpc17 #CPUs=8 Partition=math-hpc
Oct 15 12:40:45 ma-vm-slurm01 slurmctld[16956]: error: Failed to exec
/usr/bin/sendmail: No such file or directory
Oct 15 13:12:00 ma-vm-slurm01 slurmctld[1603]: _slurm_rpc_submit_batch_job:
JobId=477 InitPrio=4294901363 usec=75836582
Oct 15 13:12:23 ma-vm-slurm01 slurmctld[1603]: email msg to mrgaddy at ncsu.edu:
Slurm Job_id=477 Name=StochUnifOpt_Chu_3D_forCluster.m.job Began, Queued
time 00:01:39
Oct 15 13:12:23 ma-vm-slurm01 slurmctld[1603]: sched: Allocate JobId=477
NodeList=ma-pm-hpc17 #CPUs=8 Partition=math-hpc
Oct 15 13:12:34 ma-vm-slurm01 slurmctld[18952]: error: Failed to exec
/usr/bin/sendmail: No such file or directory
Oct 15 13:13:58 ma-vm-slurm01 slurmctld[1603]: sched:
_slurm_rpc_allocate_resources JobId=478 NodeList=ma-pm-hpc17 usec=12600182
Oct 15 13:13:58 ma-vm-slurm01 slurmctld[1603]: error: Job allocate response
msg send failure, killing JobId=478
Oct 15 13:13:58 ma-vm-slurm01 slurmctld[1603]: _job_complete: JobId=478
WTERMSIG 15
Oct 15 13:13:58 ma-vm-slurm01 slurmctld[1603]: _job_complete: JobId=478 done
Oct 15 13:36:58 ma-vm-slurm01 slurmctld[1603]: _job_complete: JobId=476
WEXITSTATUS 0
Oct 15 13:36:58 ma-vm-slurm01 slurmctld[1603]: email msg to mrgaddy at ncsu.edu:
Slurm Job_id=476 Name=StochUnifOpt_Chu_3D_forCluster.m.job Ended, Run time
00:56:22, COMPLETED, ExitCode 0
Oct 15 13:36:58 ma-vm-slurm01 slurmctld[1603]: _job_complete: JobId=476 done
Oct 15 13:37:03 ma-vm-slurm01 slurmctld[19285]: error: Failed to exec
/usr/bin/sendmail: No such file or directory
slurm.conf:
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=math-hpc
ControlMachine=ma-vm-slurm01
#ControlAddr=
BackupController=ma-vm-slurm02
#BackupAddr=
#
SlurmUser=slurm
SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
MailProg=/usr/bin/sendmail
StateSaveLocation=/mnt/HpcStor/etc/slurm/state
SlurmdSpoolDir=/var/spool/slurmd.spool
SwitchType=switch/none
MpiDefault=none
MpiParams=ports=12000-12999
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/linuxproc
#PluginDir=
CacheGroups=0
#FirstJobId=
ReturnToService=1
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=/etc/slurm/slurm.epilog.clean
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
TaskPlugin=task/cgroup
TaskPluginParam=Cores,Verbose
#TrackWCKey=no
#TreeWidth=50
TmpFS=/tmp
#UsePAM=
#
# TIMERS
SlurmctldTimeout=120
SlurmdTimeout=300
InactiveLimit=600
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/builtin
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_MEMORY
FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=5
#SlurmctldLogFile=
SlurmdDebug=5
#SlurmdLogFile=
JobCompType=jobcomp/SlurmDBD
#JobCompLoc=
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
#
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=ma-vm-slurm01
AccountingStorageBackupHost=ma-vm-slurm02
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
AccountingStorageEnforce=associations,limits
#
# COMPUTE NODES
#GresTypes=gpu
NodeName=ma-pm-hpc[01-10] RealMemory=128000 Sockets=2 CoresPerSocket=8
ThreadsPerCore=1 State=UNKNOWN
NodeName=ma-pm-hpc[11,12] RealMemory=128000 Sockets=2 CoresPerSocket=12
ThreadsPerCore=1 State=UNKNOWN
NodeName=ma-pm-hpc[13-23] RealMemory=192000 Sockets=2 CoresPerSocket=14
ThreadsPerCore=1 State=UNKNOWN
PartitionName=math-hpc Nodes=ma-pm-hpc[01-23] Default=YES
MaxTime=10-00:00:00 State=UP Shared=FORCE DefMemPerCPU=7680
Thanks,
*Kirk J. Main*
Systems Administrator, Department of Mathematics
College of Sciences
P: 919.515.6315
kjmain at ncsu.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181018/25ad7c44/attachment-0001.html>
More information about the slurm-users
mailing list