[slurm-users] Mysterious job terminations on Slurm 17.11.10

Tue Feb 5 14:47:03 UTC 2019

Excellent suggestions Chris, but they didn't pan out (and I had such 
high hopes for them!).

As for slurm.conf, here's the output from "scontrol show config" (I've 
included the original problem report below):

Configuration data as of 2019-02-05T14:29:11
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = slurmdb
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/rapl
AcctGatherFilesystemType = acct_gather_filesystem/lustre
AcctGatherInterconnectType = acct_gather_interconnect/ofed
AcctGatherNodeFreq      = 30 sec
AcctGatherProfileType   = acct_gather_profile/hdf5
AllowSpecResourcesUsage = 0
AuthInfo                = (null)
AuthType                = auth/munge
BackupAddr              = (null)
BackupController        = (null)
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2019-02-04T21:05:46
BurstBufferType         = (null)
CheckpointType          = checkpoint/none
ChosLoc                 = (null)
ClusterName             = cluster
CompleteWait            = 32 sec
ControlAddr             = slurm
ControlMachine          = slurm1,slurm2
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand
CryptoType              = crypto/munge
DebugFlags              = Energy,NO_CONF_HASH,Profile
DefMemPerNode           = UNLIMITED
DisableRootJobs         = No
EioTimeout              = 60
EnforcePartLimits       = ANY
Epilog                  = /opt/slurm/scripts/epilog.sh
EpilogMsgTime           = 4000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FastSchedule            = 1
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = (null)
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 7200 sec
HealthCheckNodeState    = IDLE
HealthCheckProgram      = /opt/nhc/1.4.2/slurm_scripts/nhc_idle_check.sh
InactiveLimit           = 120 sec
JobAcctGatherFrequency  = Task=30,Energy=30,Network=30,Filesystem=30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = (null)
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm/job_completions
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 1
KillWait                = 30 sec
LaunchParameters        = send_gids
LaunchType              = launch/slurm
Layouts                 =
Licenses                = (null)
LicensesUsed            = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 1001
MaxJobCount             = 10000
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MemLimitEnforce         = Yes
MessageTimeout          = 60 sec
MinJobAge               = 2 sec
MpiDefault              = pmix
MpiParams               = (null)
MsgAggregationParams    = (null)
NEXT_JOB_ID             = 11559
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /opt/slurm/lib/slurm
PlugStackConfig         = /opt/slurm/etc/plugstack.conf
PowerParameters         = (null)
PowerPlugin             =
PreemptMode             = OFF
PreemptType             = preempt/none
PriorityParameters      = (null)
PriorityType            = priority/basic
PrivateData             = none
ProctrackType           = proctrack/linuxproc
Prolog                  = /opt/slurm/scripts/prolog.sh
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 0
RoutePlugin             = route/default
SallocDefaultCommand    = (null)
SbcastParameters        = (null)
SchedulerParameters     = 
default_queue_depth=1000,sched_interval=6,ff_wait=2,ff_wait_value=5
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CPU
SlurmUser               = slurm(1512)
SlurmctldDebug          = debug2
SlurmctldLogFile        = /var/log/slurm/slurmctld.log
SlurmctldPort           = 6809-6816
SlurmctldSyslogDebug    = quiet
SlurmctldTimeout        = 300 sec
SlurmdDebug             = debug5
SlurmdLogFile           = /var/log/slurm/slurm%h.log
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/log/slurm/slurmd.spool
SlurmdSyslogDebug       = quiet
SlurmdTimeout           = 600 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /opt/slurm/etc/slurm.conf
SLURM_VERSION           = 17.11.10
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /slurm/state
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/affinity
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /var/log/slurm/tmp
TopologyParam           = (null)
TopologyPlugin          = topology/tree
TrackWCKey              = No
TreeWidth               = 14
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 3600 sec

Account Gather
InterconnectOFEDPort    = 1
ProfileHDF5Default      = None
ProfileHDF5Dir          = /data/slurm/profile_data

Slurmctld(primary/backup) at slurm1,slurm2/(NULL) are UP/DOWN

====
Here's the original problem statement:

Hi All,

Just checking to see if this sounds familiar to anyone.

Environment:
- CentOS 7.5 x86_64
- Slurm 17.11.10 (but this also happened with 17.11.5)

We typically run about 100 tests/night, selected from a handful of 
favorites. For roughly 1 in 300 test runs, we see one of two mysterious 
failures:

1. The 5 minute cancellation

A job will be rolling along, generating it's expected output, and then 
this message appears:

    srun: forcing job termination
    srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
    slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
    2019-01-30T07:35:50 ***
    srun: error: nodename: task 250: Terminated
    srun: Terminating job step 3531.0

sacct reports

            JobID               Start                 End ExitCode     
    State
    ------------ ------------------- ------------------- --------
    ----------
    3418         2019-01-29T05:54:07 2019-01-29T05:59:16 0:9     FAILED

These failures consistently happen at just about 5 minutes into the run 
when they happen.

2. The random cancellation

As above, a job will be generating the expected output, and then we see

    srun: forcing job termination
    srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
    slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
    2019-01-30T07:35:50 ***
    srun: error: nodename: task 250: Terminated
    srun: Terminating job step 3531.0

But this time, sacct reports

            JobID               Start                 End ExitCode     
    State
    ------------ ------------------- ------------------- --------
    ----------
    3531         2019-01-30T07:21:25 2019-01-30T07:35:50      0:0 COMPLETED
    3531.0       2019-01-30T07:21:27 2019-01-30T07:35:56     0:15 CANCELLED

I think we've seen these cancellations pop up as soon as a minute or two 
into the test run, up to perhaps 20 minutes into the run.

The only thing slightly unusual in our job submissions is that we use 
srun's "--immediate=120" so that the scripts can respond appropriately 
if a node goes down.

With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a clue in 
the slurmctld or slurmd logs.

Any thoughts on what might be happening, or what I might try next?

------------------------------------------------------------------------
*From:* Chris Samuel <chris at csamuel.org>
*Sent:* Saturday, February 02, 2019 1:38AM
*To:* Slurm-users <slurm-users at lists.schedmd.com>
*Cc:*
*Subject:* Re: [slurm-users] Mysterious job terminations on Slurm 17.11.10

On Friday, 1 February 2019 6:04:45 AM AEDT Andy Riebs wrote:

> Any thoughts on what might be happening, or what I might try next?

Anything in dmesg on the nodes or syslog at that time?

I'm wondering if you're seeing the OOM killer step in and take processes out.

What does your slurm.conf look like?

All the best,
Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190205/ef9339c4/attachment-0001.html>