[slurm-users] Mysterious job terminations on Slurm 17.11.10
Andy Riebs
andy.riebs at hpe.com
Tue Feb 5 14:47:03 UTC 2019
Excellent suggestions Chris, but they didn't pan out (and I had such
high hopes for them!).
As for slurm.conf, here's the output from "scontrol show config" (I've
included the original problem report below):
Configuration data as of 2019-02-05T14:29:11
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost = slurmdb
AccountingStorageLoc = N/A
AccountingStoragePort = 6819
AccountingStorageTRES = cpu,mem,energy,node,billing
AccountingStorageType = accounting_storage/slurmdbd
AccountingStorageUser = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType = acct_gather_energy/rapl
AcctGatherFilesystemType = acct_gather_filesystem/lustre
AcctGatherInterconnectType = acct_gather_interconnect/ofed
AcctGatherNodeFreq = 30 sec
AcctGatherProfileType = acct_gather_profile/hdf5
AllowSpecResourcesUsage = 0
AuthInfo = (null)
AuthType = auth/munge
BackupAddr = (null)
BackupController = (null)
BatchStartTimeout = 10 sec
BOOT_TIME = 2019-02-04T21:05:46
BurstBufferType = (null)
CheckpointType = checkpoint/none
ChosLoc = (null)
ClusterName = cluster
CompleteWait = 32 sec
ControlAddr = slurm
ControlMachine = slurm1,slurm2
CoreSpecPlugin = core_spec/none
CpuFreqDef = Unknown
CpuFreqGovernors = Performance,OnDemand
CryptoType = crypto/munge
DebugFlags = Energy,NO_CONF_HASH,Profile
DefMemPerNode = UNLIMITED
DisableRootJobs = No
EioTimeout = 60
EnforcePartLimits = ANY
Epilog = /opt/slurm/scripts/epilog.sh
EpilogMsgTime = 4000 usec
EpilogSlurmctld = (null)
ExtSensorsType = ext_sensors/none
ExtSensorsFreq = 0 sec
FastSchedule = 1
FederationParameters = (null)
FirstJobId = 1
GetEnvTimeout = 2 sec
GresTypes = (null)
GroupUpdateForce = 1
GroupUpdateTime = 600 sec
HASH_VAL = Match
HealthCheckInterval = 7200 sec
HealthCheckNodeState = IDLE
HealthCheckProgram = /opt/nhc/1.4.2/slurm_scripts/nhc_idle_check.sh
InactiveLimit = 120 sec
JobAcctGatherFrequency = Task=30,Energy=30,Network=30,Filesystem=30
JobAcctGatherType = jobacct_gather/linux
JobAcctGatherParams = (null)
JobCheckpointDir = /var/slurm/checkpoint
JobCompHost = localhost
JobCompLoc = /var/log/slurm/job_completions
JobCompPort = 0
JobCompType = jobcomp/none
JobCompUser = root
JobContainerType = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend = 0
JobRequeue = 1
JobSubmitPlugins = (null)
KeepAliveTime = SYSTEM_DEFAULT
KillOnBadExit = 1
KillWait = 30 sec
LaunchParameters = send_gids
LaunchType = launch/slurm
Layouts =
Licenses = (null)
LicensesUsed = (null)
LogTimeFormat = iso8601_ms
MailDomain = (null)
MailProg = /bin/mail
MaxArraySize = 1001
MaxJobCount = 10000
MaxJobId = 67043328
MaxMemPerNode = UNLIMITED
MaxStepCount = 40000
MaxTasksPerNode = 512
MCSPlugin = mcs/none
MCSParameters = (null)
MemLimitEnforce = Yes
MessageTimeout = 60 sec
MinJobAge = 2 sec
MpiDefault = pmix
MpiParams = (null)
MsgAggregationParams = (null)
NEXT_JOB_ID = 11559
NodeFeaturesPlugins = (null)
OverTimeLimit = 0 min
PluginDir = /opt/slurm/lib/slurm
PlugStackConfig = /opt/slurm/etc/plugstack.conf
PowerParameters = (null)
PowerPlugin =
PreemptMode = OFF
PreemptType = preempt/none
PriorityParameters = (null)
PriorityType = priority/basic
PrivateData = none
ProctrackType = proctrack/linuxproc
Prolog = /opt/slurm/scripts/prolog.sh
PrologEpilogTimeout = 65534
PrologSlurmctld = (null)
PrologFlags = (null)
PropagatePrioProcess = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram = (null)
ReconfigFlags = (null)
RequeueExit = (null)
RequeueExitHold = (null)
ResumeProgram = (null)
ResumeRate = 300 nodes/min
ResumeTimeout = 60 sec
ResvEpilog = (null)
ResvOverRun = 0 min
ResvProlog = (null)
ReturnToService = 0
RoutePlugin = route/default
SallocDefaultCommand = (null)
SbcastParameters = (null)
SchedulerParameters =
default_queue_depth=1000,sched_interval=6,ff_wait=2,ff_wait_value=5
SchedulerTimeSlice = 30 sec
SchedulerType = sched/backfill
SelectType = select/cons_res
SelectTypeParameters = CR_CPU
SlurmUser = slurm(1512)
SlurmctldDebug = debug2
SlurmctldLogFile = /var/log/slurm/slurmctld.log
SlurmctldPort = 6809-6816
SlurmctldSyslogDebug = quiet
SlurmctldTimeout = 300 sec
SlurmdDebug = debug5
SlurmdLogFile = /var/log/slurm/slurm%h.log
SlurmdPidFile = /var/run/slurmd.pid
SlurmdPort = 6818
SlurmdSpoolDir = /var/log/slurm/slurmd.spool
SlurmdSyslogDebug = quiet
SlurmdTimeout = 600 sec
SlurmdUser = root(0)
SlurmSchedLogFile = (null)
SlurmSchedLogLevel = 0
SlurmctldPidFile = /var/run/slurmctld.pid
SlurmctldPlugstack = (null)
SLURM_CONF = /opt/slurm/etc/slurm.conf
SLURM_VERSION = 17.11.10
SrunEpilog = (null)
SrunPortRange = 0-0
SrunProlog = (null)
StateSaveLocation = /slurm/state
SuspendExcNodes = (null)
SuspendExcParts = (null)
SuspendProgram = (null)
SuspendRate = 60 nodes/min
SuspendTime = NONE
SuspendTimeout = 30 sec
SwitchType = switch/none
TaskEpilog = (null)
TaskPlugin = task/affinity
TaskPluginParam = (null type)
TaskProlog = (null)
TCPTimeout = 2 sec
TmpFS = /var/log/slurm/tmp
TopologyParam = (null)
TopologyPlugin = topology/tree
TrackWCKey = No
TreeWidth = 14
UsePam = 0
UnkillableStepProgram = (null)
UnkillableStepTimeout = 60 sec
VSizeFactor = 0 percent
WaitTime = 3600 sec
Account Gather
InterconnectOFEDPort = 1
ProfileHDF5Default = None
ProfileHDF5Dir = /data/slurm/profile_data
Slurmctld(primary/backup) at slurm1,slurm2/(NULL) are UP/DOWN
====
Here's the original problem statement:
Hi All,
Just checking to see if this sounds familiar to anyone.
Environment:
- CentOS 7.5 x86_64
- Slurm 17.11.10 (but this also happened with 17.11.5)
We typically run about 100 tests/night, selected from a handful of
favorites. For roughly 1 in 300 test runs, we see one of two mysterious
failures:
1. The 5 minute cancellation
A job will be rolling along, generating it's expected output, and then
this message appears:
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
2019-01-30T07:35:50 ***
srun: error: nodename: task 250: Terminated
srun: Terminating job step 3531.0
sacct reports
JobID Start End ExitCode
State
------------ ------------------- ------------------- --------
----------
3418 2019-01-29T05:54:07 2019-01-29T05:59:16 0:9 FAILED
These failures consistently happen at just about 5 minutes into the run
when they happen.
2. The random cancellation
As above, a job will be generating the expected output, and then we see
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
2019-01-30T07:35:50 ***
srun: error: nodename: task 250: Terminated
srun: Terminating job step 3531.0
But this time, sacct reports
JobID Start End ExitCode
State
------------ ------------------- ------------------- --------
----------
3531 2019-01-30T07:21:25 2019-01-30T07:35:50 0:0 COMPLETED
3531.0 2019-01-30T07:21:27 2019-01-30T07:35:56 0:15 CANCELLED
I think we've seen these cancellations pop up as soon as a minute or two
into the test run, up to perhaps 20 minutes into the run.
The only thing slightly unusual in our job submissions is that we use
srun's "--immediate=120" so that the scripts can respond appropriately
if a node goes down.
With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a clue in
the slurmctld or slurmd logs.
Any thoughts on what might be happening, or what I might try next?
------------------------------------------------------------------------
*From:* Chris Samuel <chris at csamuel.org>
*Sent:* Saturday, February 02, 2019 1:38AM
*To:* Slurm-users <slurm-users at lists.schedmd.com>
*Cc:*
*Subject:* Re: [slurm-users] Mysterious job terminations on Slurm 17.11.10
On Friday, 1 February 2019 6:04:45 AM AEDT Andy Riebs wrote:
> Any thoughts on what might be happening, or what I might try next?
Anything in dmesg on the nodes or syslog at that time?
I'm wondering if you're seeing the OOM killer step in and take processes out.
What does your slurm.conf look like?
All the best,
Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190205/ef9339c4/attachment-0001.html>
More information about the slurm-users
mailing list