[slurm-users] timeouts submitting sbatch jobs

Altemara, Anthony Anthony.Altemara at q2labsolutions.com
Fri Nov 17 16:43:45 UTC 2023


Hey SLURM user group,

We're seeing intermittent issues with timeouts contacting the controller / slurmctld using sbatch in a non-interactive / non-blocking fashion, even with MessageTimeout increased to 60s. Our slurmctld does not have a slurmdbd configured, and CPU load / memory usage is barely  registering. We're not seeing latency in storage I/O on the controller. In this specific cluster, we are running less than 50 nodes, and are not submitting more than 10's of jobs per second.

Any ideas on how we can troubleshoot this issue?
We have done packet captures on both the controller and the submission nodes, and see no obvious issues with the network.

What do you have your MessageTimeout set to?

# sinfo --version
slurm 18.08.9

# scontrol | show config

Configuration data as of 2023-11-17T10:14:25 AccountingStorageBackupHost = (null) AccountingStorageEnforce = none

AccountingStorageHost   = localhost

AccountingStorageLoc    = <<REDACTED>>/accounting.log

AccountingStoragePort   = 0

AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages

AccountingStorageType   = accounting_storage/filetxt

AccountingStorageUser   = root

AccountingStoreJobComment = Yes

AcctGatherEnergyType    = acct_gather_energy/none

AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none

AcctGatherNodeFreq      = 10 sec

AcctGatherProfileType   = acct_gather_profile/none

AllowSpecResourcesUsage = 0

AuthInfo                = (null)

AuthType                = auth/none

BatchStartTimeout       = 240 sec

BOOT_TIME               = 2023-11-15T14:24:09

BurstBufferType         = (null)

CheckpointType          = checkpoint/none

ClusterName             = <<REDACTED>>/

CommunicationParameters = (null)

CompleteWait            = 0 sec

CoreSpecPlugin          = core_spec/none

CpuFreqDef              = Unknown

CpuFreqGovernors        = Performance,OnDemand,UserSpace

CryptoType              = crypto/openssl

DebugFlags              = (null)

DefMemPerCPU            = 1024

DisableRootJobs         = No

EioTimeout              = 60

EnforcePartLimits       = NO

Epilog                  = (null)

EpilogMsgTime           = 2000 usec

EpilogSlurmctld         = (null)

ExtSensorsType          = ext_sensors/none

ExtSensorsFreq          = 0 sec

FairShareDampeningFactor = 1

FastSchedule            = 1

FederationParameters    = (null)

FirstJobId              = 1

GetEnvTimeout           = 2 sec

GresTypes               = (null)

GroupUpdateForce        = 1

GroupUpdateTime         = 600 sec

HASH_VAL                = Match

HealthCheckInterval     = 60 sec

HealthCheckNodeState    = ANY

HealthCheckProgram      = /bin/ls

InactiveLimit           = 3600 sec

JobAcctGatherFrequency  = task=5

JobAcctGatherType       = jobacct_gather/none

JobAcctGatherParams     = NoOverMemoryKill

JobCheckpointDir        = <<REDACTED>>/checkpoint

JobCompHost             = localhost

JobCompLoc              = <<REDACTED>>/job_completions.log

JobCompPort             = 0

JobCompType             = jobcomp/filetxt

JobCompUser             = root

JobContainerType        = job_container/none

JobCredentialPrivateKey = /etc/slurm/pki/slurm.key

JobCredentialPublicCertificate = /etc/slurm/pki/slurm.crt

JobDefaults             = (null)

JobFileAppend           = 1

JobRequeue              = 1

JobSubmitPlugins        = (null)

KeepAliveTime           = SYSTEM_DEFAULT

KillOnBadExit           = 1

KillWait                = 60 sec

LaunchParameters        = slurmstepd_memlock

LaunchType              = launch/slurm

Layouts                 =

Licenses                = <<REDACTED>>

LicensesUsed            = <<REDACTED>>

LogTimeFormat           = iso8601_ms

MailDomain              = <<REDACTED>>

MailProg                = /bin/mail

MaxArraySize            = 1001

MaxJobCount             = 10000

MaxJobId                = 67043328

MaxMemPerNode           = UNLIMITED

MaxStepCount            = 40000

MaxTasksPerNode         = 512

MCSPlugin               = mcs/none

MCSParameters           = (null)

MemLimitEnforce         = No

MessageTimeout          = 60 sec

MinJobAge               = 600 sec

MpiDefault              = none

MpiParams               = (null)

MsgAggregationParams    = (null)

NEXT_JOB_ID             = 7262059

NodeFeaturesPlugins     = (null)

OverTimeLimit           = 10 min

PluginDir               = <<REDACTED>>/slurm

PlugStackConfig         = /etc/slurm/plugstack.conf

PowerParameters         = (null)

PowerPlugin             =

PreemptMode             = OFF

PreemptType             = preempt/none

PriorityParameters      = (null)

PriorityDecayHalfLife   = 7-00:00:00

PriorityCalcPeriod      = 00:05:00

PriorityFavorSmall      = Yes

PriorityFlags           =

PriorityMaxAge          = 1-00:00:00

PriorityUsageResetPeriod = NONE

PriorityType            = priority/multifactor

PriorityWeightAge       = 1000

PriorityWeightFairShare = 0

PriorityWeightJobSize   = 500

PriorityWeightPartition = 1000

PriorityWeightQOS       = 0

PriorityWeightTRES      = (null)

PrivateData             = none

ProctrackType           = proctrack/linuxproc

Prolog                  = (null)

PrologEpilogTimeout     = 2700

PrologSlurmctld         = (null)

PrologFlags             = (null)

PropagatePrioProcess    = 0

PropagateResourceLimits = NONE

PropagateResourceLimitsExcept = (null)

RebootProgram           = (null)

ReconfigFlags           = (null)

RequeueExit             = (null)

RequeueExitHold         = (null)

ResumeFailProgram       = (null)

ResumeProgram           = (null)

ResumeRate              = 300 nodes/min

ResumeTimeout           = 60 sec

ResvEpilog              = (null)

ResvOverRun             = 0 min

ResvProlog              = (null)

ReturnToService         = 2

RoutePlugin             = route/default

SallocDefaultCommand    = (null)

SbcastParameters        = (null)

SchedulerParameters     = nohold_on_prolog_fail

SchedulerTimeSlice      = 30 sec

SchedulerType           = sched/backfill

SelectType              = select/cons_res

SelectTypeParameters    = CR_CPU_MEMORY,CR_LLN

SlurmUser               = slurm(64030)

SlurmctldAddr           = (null)

SlurmctldDebug          = info

SlurmctldHost[0]        = <<REDACTED>>

SlurmctldLogFile        = <<REDACTED>>/slurmctld.log

SlurmctldPort           = 6817

SlurmctldSyslogDebug    = debug2

SlurmctldPrimaryOffProg = (null)

SlurmctldPrimaryOnProg  = (null)

SlurmctldTimeout        = 240 sec

SlurmctldParameters     = (null)

SlurmdDebug             = info

SlurmdLogFile           = /var/log/slurm/slurmd.log

SlurmdParameters        = (null)

SlurmdPidFile           = /run/slurmd.pid

SlurmdPort              = 6818

SlurmdSpoolDir          = /var/lib/slurm/slurmd

SlurmdSyslogDebug       = error

SlurmdTimeout           = 300 sec

SlurmdUser              = root(0)

SlurmSchedLogFile       = <<REDACTED>>/slurmsched.log

SlurmSchedLogLevel      = 3

SlurmctldPidFile        = /run/slurmctld.pid

SlurmctldPlugstack      = (null)

SLURM_CONF              = /etc/slurm/slurm.conf

SLURM_VERSION           = 18.08.9

SrunEpilog              = (null)

SrunPortRange           = 0-0

SrunProlog              = (null)

StateSaveLocation       = <<REDACTED>>/slurmctld

SuspendExcNodes         = (null)

SuspendExcParts         = (null)

SuspendProgram          = (null)

SuspendRate             = 60 nodes/min

SuspendTime             = NONE

SuspendTimeout          = 30 sec

SwitchType              = switch/none

TaskEpilog              = (null)

TaskPlugin              = task/none

TaskPluginParam         = (null type)

TaskProlog              = (null)

TCPTimeout              = 60 sec

TmpFS                   = /tmp

TopologyParam           = (null)

TopologyPlugin          = topology/none

TrackWCKey              = No

TreeWidth               = 50

UsePam                  = 0

UnkillableStepProgram   = (null)

UnkillableStepTimeout   = 240 sec

VSizeFactor             = 0 percent

WaitTime                = 300 sec

X11Parameters           = (null)



Slurmctld(primary) at <<REDACTED>> is UP


Anthony Altemara
IT Infrastructure Associate Director
[cid1603022056*image003.png at 01D8384E.C479F0C0]
Office: +1 919.491.2220
Anthony.Altemara at Q2LabSolutions.com<mailto:Anthony.Altemara at Q2LabSolutions.com>    |  www.Q2LabSolutions.com<http://www.q2labsolutions.com/>



________________________________________
IMPORTANT - PLEASE READ: This electronic message, including its attachments, is CONFIDENTIAL and may contain PROPRIETARY or LEGALLY PRIVILEGED or PROTECTED information and is intended for the authorized recipient of the sender. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying, or distribution of this message or any of the information included in it is unauthorized and strictly prohibited. If you have received this message in error, please immediately notify the sender by reply e-mail and permanently delete this message and its attachments, along with any copies thereof, from all locations received (e.g., computer, mobile device, etc.). To the extent permitted by law, we may monitor electronic communications for the purposes of ensuring compliance with our legal and regulatory obligations and internal policies. We may also collect email traffic headers for analyzing patterns of network traffic and managing client relationships. For further information see our privacy-policy<https://www.iqvia.com/about-us/privacy/privacy-policy>. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231117/08af2dee/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 22429 bytes
Desc: image001.png
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231117/08af2dee/attachment-0001.png>


More information about the slurm-users mailing list