<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body smarttemplateinserted="true">
    Excellent suggestions Chris, but they didn't pan out (and I had such
    high hopes for them!).<br>
    <br>
    As for slurm.conf, here's the output from "scontrol show config"
    (I've included the original problem report below):<br>
    <br>
    <tt>Configuration data as of 2019-02-05T14:29:11</tt><tt><br>
    </tt><tt>AccountingStorageBackupHost = (null)</tt><tt><br>
    </tt><tt>AccountingStorageEnforce = none</tt><tt><br>
    </tt><tt>AccountingStorageHost   = slurmdb</tt><tt><br>
    </tt><tt>AccountingStorageLoc    = N/A</tt><tt><br>
    </tt><tt>AccountingStoragePort   = 6819</tt><tt><br>
    </tt><tt>AccountingStorageTRES   = cpu,mem,energy,node,billing</tt><tt><br>
    </tt><tt>AccountingStorageType   = accounting_storage/slurmdbd</tt><tt><br>
    </tt><tt>AccountingStorageUser   = N/A</tt><tt><br>
    </tt><tt>AccountingStoreJobComment = Yes</tt><tt><br>
    </tt><tt>AcctGatherEnergyType    = acct_gather_energy/rapl</tt><tt><br>
    </tt><tt>AcctGatherFilesystemType = acct_gather_filesystem/lustre</tt><tt><br>
    </tt><tt>AcctGatherInterconnectType = acct_gather_interconnect/ofed</tt><tt><br>
    </tt><tt>AcctGatherNodeFreq      = 30 sec</tt><tt><br>
    </tt><tt>AcctGatherProfileType   = acct_gather_profile/hdf5</tt><tt><br>
    </tt><tt>AllowSpecResourcesUsage = 0</tt><tt><br>
    </tt><tt>AuthInfo                = (null)</tt><tt><br>
    </tt><tt>AuthType                = auth/munge</tt><tt><br>
    </tt><tt>BackupAddr              = (null)</tt><tt><br>
    </tt><tt>BackupController        = (null)</tt><tt><br>
    </tt><tt>BatchStartTimeout       = 10 sec</tt><tt><br>
    </tt><tt>BOOT_TIME               = 2019-02-04T21:05:46</tt><tt><br>
    </tt><tt>BurstBufferType         = (null)</tt><tt><br>
    </tt><tt>CheckpointType          = checkpoint/none</tt><tt><br>
    </tt><tt>ChosLoc                 = (null)</tt><tt><br>
    </tt><tt>ClusterName             = cluster</tt><tt><br>
    </tt><tt>CompleteWait            = 32 sec</tt><tt><br>
    </tt><tt>ControlAddr             = slurm</tt><tt><br>
    </tt><tt>ControlMachine          = slurm1,slurm2</tt><tt><br>
    </tt><tt>CoreSpecPlugin          = core_spec/none</tt><tt><br>
    </tt><tt>CpuFreqDef              = Unknown</tt><tt><br>
    </tt><tt>CpuFreqGovernors        = Performance,OnDemand</tt><tt><br>
    </tt><tt>CryptoType              = crypto/munge</tt><tt><br>
    </tt><tt>DebugFlags              = Energy,NO_CONF_HASH,Profile</tt><tt><br>
    </tt><tt>DefMemPerNode           = UNLIMITED</tt><tt><br>
    </tt><tt>DisableRootJobs         = No</tt><tt><br>
    </tt><tt>EioTimeout              = 60</tt><tt><br>
    </tt><tt>EnforcePartLimits       = ANY</tt><tt><br>
    </tt><tt>Epilog                  = /opt/slurm/scripts/epilog.sh</tt><tt><br>
    </tt><tt>EpilogMsgTime           = 4000 usec</tt><tt><br>
    </tt><tt>EpilogSlurmctld         = (null)</tt><tt><br>
    </tt><tt>ExtSensorsType          = ext_sensors/none</tt><tt><br>
    </tt><tt>ExtSensorsFreq          = 0 sec</tt><tt><br>
    </tt><tt>FastSchedule            = 1</tt><tt><br>
    </tt><tt>FederationParameters    = (null)</tt><tt><br>
    </tt><tt>FirstJobId              = 1</tt><tt><br>
    </tt><tt>GetEnvTimeout           = 2 sec</tt><tt><br>
    </tt><tt>GresTypes               = (null)</tt><tt><br>
    </tt><tt>GroupUpdateForce        = 1</tt><tt><br>
    </tt><tt>GroupUpdateTime         = 600 sec</tt><tt><br>
    </tt><tt>HASH_VAL                = Match</tt><tt><br>
    </tt><tt>HealthCheckInterval     = 7200 sec</tt><tt><br>
    </tt><tt>HealthCheckNodeState    = IDLE</tt><tt><br>
    </tt><tt>HealthCheckProgram      =
      /opt/nhc/1.4.2/slurm_scripts/nhc_idle_check.sh</tt><tt><br>
    </tt><tt>InactiveLimit           = 120 sec</tt><tt><br>
    </tt><tt>JobAcctGatherFrequency  =
      Task=30,Energy=30,Network=30,Filesystem=30</tt><tt><br>
    </tt><tt>JobAcctGatherType       = jobacct_gather/linux</tt><tt><br>
    </tt><tt>JobAcctGatherParams     = (null)</tt><tt><br>
    </tt><tt>JobCheckpointDir        = /var/slurm/checkpoint</tt><tt><br>
    </tt><tt>JobCompHost             = localhost</tt><tt><br>
    </tt><tt>JobCompLoc              = /var/log/slurm/job_completions</tt><tt><br>
    </tt><tt>JobCompPort             = 0</tt><tt><br>
    </tt><tt>JobCompType             = jobcomp/none</tt><tt><br>
    </tt><tt>JobCompUser             = root</tt><tt><br>
    </tt><tt>JobContainerType        = job_container/none</tt><tt><br>
    </tt><tt>JobCredentialPrivateKey = (null)</tt><tt><br>
    </tt><tt>JobCredentialPublicCertificate = (null)</tt><tt><br>
    </tt><tt>JobFileAppend           = 0</tt><tt><br>
    </tt><tt>JobRequeue              = 1</tt><tt><br>
    </tt><tt>JobSubmitPlugins        = (null)</tt><tt><br>
    </tt><tt>KeepAliveTime           = SYSTEM_DEFAULT</tt><tt><br>
    </tt><tt>KillOnBadExit           = 1</tt><tt><br>
    </tt><tt>KillWait                = 30 sec</tt><tt><br>
    </tt><tt>LaunchParameters        = send_gids</tt><tt><br>
    </tt><tt>LaunchType              = launch/slurm</tt><tt><br>
    </tt><tt>Layouts                 = </tt><tt><br>
    </tt><tt>Licenses                = (null)</tt><tt><br>
    </tt><tt>LicensesUsed            = (null)</tt><tt><br>
    </tt><tt>LogTimeFormat           = iso8601_ms</tt><tt><br>
    </tt><tt>MailDomain              = (null)</tt><tt><br>
    </tt><tt>MailProg                = /bin/mail</tt><tt><br>
    </tt><tt>MaxArraySize            = 1001</tt><tt><br>
    </tt><tt>MaxJobCount             = 10000</tt><tt><br>
    </tt><tt>MaxJobId                = 67043328</tt><tt><br>
    </tt><tt>MaxMemPerNode           = UNLIMITED</tt><tt><br>
    </tt><tt>MaxStepCount            = 40000</tt><tt><br>
    </tt><tt>MaxTasksPerNode         = 512</tt><tt><br>
    </tt><tt>MCSPlugin               = mcs/none</tt><tt><br>
    </tt><tt>MCSParameters           = (null)</tt><tt><br>
    </tt><tt>MemLimitEnforce         = Yes</tt><tt><br>
    </tt><tt>MessageTimeout          = 60 sec</tt><tt><br>
    </tt><tt>MinJobAge               = 2 sec</tt><tt><br>
    </tt><tt>MpiDefault              = pmix</tt><tt><br>
    </tt><tt>MpiParams               = (null)</tt><tt><br>
    </tt><tt>MsgAggregationParams    = (null)</tt><tt><br>
    </tt><tt>NEXT_JOB_ID             = 11559</tt><tt><br>
    </tt><tt>NodeFeaturesPlugins     = (null)</tt><tt><br>
    </tt><tt>OverTimeLimit           = 0 min</tt><tt><br>
    </tt><tt>PluginDir               = /opt/slurm/lib/slurm</tt><tt><br>
    </tt><tt>PlugStackConfig         = /opt/slurm/etc/plugstack.conf</tt><tt><br>
    </tt><tt>PowerParameters         = (null)</tt><tt><br>
    </tt><tt>PowerPlugin             = </tt><tt><br>
    </tt><tt>PreemptMode             = OFF</tt><tt><br>
    </tt><tt>PreemptType             = preempt/none</tt><tt><br>
    </tt><tt>PriorityParameters      = (null)</tt><tt><br>
    </tt><tt>PriorityType            = priority/basic</tt><tt><br>
    </tt><tt>PrivateData             = none</tt><tt><br>
    </tt><tt>ProctrackType           = proctrack/linuxproc</tt><tt><br>
    </tt><tt>Prolog                  = /opt/slurm/scripts/prolog.sh</tt><tt><br>
    </tt><tt>PrologEpilogTimeout     = 65534</tt><tt><br>
    </tt><tt>PrologSlurmctld         = (null)</tt><tt><br>
    </tt><tt>PrologFlags             = (null)</tt><tt><br>
    </tt><tt>PropagatePrioProcess    = 0</tt><tt><br>
    </tt><tt>PropagateResourceLimits = ALL</tt><tt><br>
    </tt><tt>PropagateResourceLimitsExcept = (null)</tt><tt><br>
    </tt><tt>RebootProgram           = (null)</tt><tt><br>
    </tt><tt>ReconfigFlags           = (null)</tt><tt><br>
    </tt><tt>RequeueExit             = (null)</tt><tt><br>
    </tt><tt>RequeueExitHold         = (null)</tt><tt><br>
    </tt><tt>ResumeProgram           = (null)</tt><tt><br>
    </tt><tt>ResumeRate              = 300 nodes/min</tt><tt><br>
    </tt><tt>ResumeTimeout           = 60 sec</tt><tt><br>
    </tt><tt>ResvEpilog              = (null)</tt><tt><br>
    </tt><tt>ResvOverRun             = 0 min</tt><tt><br>
    </tt><tt>ResvProlog              = (null)</tt><tt><br>
    </tt><tt>ReturnToService         = 0</tt><tt><br>
    </tt><tt>RoutePlugin             = route/default</tt><tt><br>
    </tt><tt>SallocDefaultCommand    = (null)</tt><tt><br>
    </tt><tt>SbcastParameters        = (null)</tt><tt><br>
    </tt><tt>SchedulerParameters     =
      default_queue_depth=1000,sched_interval=6,ff_wait=2,ff_wait_value=5</tt><tt><br>
    </tt><tt>SchedulerTimeSlice      = 30 sec</tt><tt><br>
    </tt><tt>SchedulerType           = sched/backfill</tt><tt><br>
    </tt><tt>SelectType              = select/cons_res</tt><tt><br>
    </tt><tt>SelectTypeParameters    = CR_CPU</tt><tt><br>
    </tt><tt>SlurmUser               = slurm(1512)</tt><tt><br>
    </tt><tt>SlurmctldDebug          = debug2</tt><tt><br>
    </tt><tt>SlurmctldLogFile        = /var/log/slurm/slurmctld.log</tt><tt><br>
    </tt><tt>SlurmctldPort           = 6809-6816</tt><tt><br>
    </tt><tt>SlurmctldSyslogDebug    = quiet</tt><tt><br>
    </tt><tt>SlurmctldTimeout        = 300 sec</tt><tt><br>
    </tt><tt>SlurmdDebug             = debug5</tt><tt><br>
    </tt><tt>SlurmdLogFile           = /var/log/slurm/slurm%h.log</tt><tt><br>
    </tt><tt>SlurmdPidFile           = /var/run/slurmd.pid</tt><tt><br>
    </tt><tt>SlurmdPort              = 6818</tt><tt><br>
    </tt><tt>SlurmdSpoolDir          = /var/log/slurm/slurmd.spool</tt><tt><br>
    </tt><tt>SlurmdSyslogDebug       = quiet</tt><tt><br>
    </tt><tt>SlurmdTimeout           = 600 sec</tt><tt><br>
    </tt><tt>SlurmdUser              = root(0)</tt><tt><br>
    </tt><tt>SlurmSchedLogFile       = (null)</tt><tt><br>
    </tt><tt>SlurmSchedLogLevel      = 0</tt><tt><br>
    </tt><tt>SlurmctldPidFile        = /var/run/slurmctld.pid</tt><tt><br>
    </tt><tt>SlurmctldPlugstack      = (null)</tt><tt><br>
    </tt><tt>SLURM_CONF              = /opt/slurm/etc/slurm.conf</tt><tt><br>
    </tt><tt>SLURM_VERSION           = 17.11.10</tt><tt><br>
    </tt><tt>SrunEpilog              = (null)</tt><tt><br>
    </tt><tt>SrunPortRange           = 0-0</tt><tt><br>
    </tt><tt>SrunProlog              = (null)</tt><tt><br>
    </tt><tt>StateSaveLocation       = /slurm/state</tt><tt><br>
    </tt><tt>SuspendExcNodes         = (null)</tt><tt><br>
    </tt><tt>SuspendExcParts         = (null)</tt><tt><br>
    </tt><tt>SuspendProgram          = (null)</tt><tt><br>
    </tt><tt>SuspendRate             = 60 nodes/min</tt><tt><br>
    </tt><tt>SuspendTime             = NONE</tt><tt><br>
    </tt><tt>SuspendTimeout          = 30 sec</tt><tt><br>
    </tt><tt>SwitchType              = switch/none</tt><tt><br>
    </tt><tt>TaskEpilog              = (null)</tt><tt><br>
    </tt><tt>TaskPlugin              = task/affinity</tt><tt><br>
    </tt><tt>TaskPluginParam         = (null type)</tt><tt><br>
    </tt><tt>TaskProlog              = (null)</tt><tt><br>
    </tt><tt>TCPTimeout              = 2 sec</tt><tt><br>
    </tt><tt>TmpFS                   = /var/log/slurm/tmp</tt><tt><br>
    </tt><tt>TopologyParam           = (null)</tt><tt><br>
    </tt><tt>TopologyPlugin          = topology/tree</tt><tt><br>
    </tt><tt>TrackWCKey              = No</tt><tt><br>
    </tt><tt>TreeWidth               = 14</tt><tt><br>
    </tt><tt>UsePam                  = 0</tt><tt><br>
    </tt><tt>UnkillableStepProgram   = (null)</tt><tt><br>
    </tt><tt>UnkillableStepTimeout   = 60 sec</tt><tt><br>
    </tt><tt>VSizeFactor             = 0 percent</tt><tt><br>
    </tt><tt>WaitTime                = 3600 sec</tt><tt><br>
    </tt><tt><br>
    </tt><tt>Account Gather</tt><tt><br>
    </tt><tt>InterconnectOFEDPort    = 1</tt><tt><br>
    </tt><tt>ProfileHDF5Default      = None</tt><tt><br>
    </tt><tt>ProfileHDF5Dir          = /data/slurm/profile_data</tt><tt><br>
    </tt><tt><br>
    </tt><tt>Slurmctld(primary/backup) at slurm1,slurm2/(NULL) are
      UP/DOWN<br>
      <br>
    </tt>====<br>
    Here's the original problem statement:<br>
    <br>
    Hi All,<br>
    <br>
    Just checking to see if this sounds familiar to anyone.<br>
    <br>
    Environment:<br>
    - CentOS 7.5 x86_64<br>
    - Slurm 17.11.10 (but this also happened with 17.11.5)<br>
    <br>
    We typically run about 100 tests/night, selected from a handful of
    favorites. For roughly 1 in 300 test runs, we see one of two
    mysterious failures:<br>
    <br>
    1. The 5 minute cancellation<br>
    <br>
    A job will be rolling along, generating it's expected output, and
    then this message appears:<br>
    <blockquote>srun: forcing job termination<br>
      srun: Job step aborted: Waiting up to 32 seconds for job step to
      finish.<br>
      slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
      2019-01-30T07:35:50 ***<br>
      srun: error: nodename: task 250: Terminated<br>
      srun: Terminating job step 3531.0<br>
    </blockquote>
    sacct reports<br>
    <blockquote><tt>       JobID               Start                 End
        ExitCode      State </tt><br>
      <tt>------------ ------------------- ------------------- --------
        ---------- </tt><br>
      <tt>3418         2019-01-29T05:54:07 2019-01-29T05:59:16     
        0:9     FAILED</tt><br>
    </blockquote>
    These failures consistently happen at just about 5 minutes into the
    run when they happen.<br>
    <br>
    2. The random cancellation<br>
    <br>
    As above, a job will be generating the expected output, and then we
    see<br>
    <blockquote>srun: forcing job termination<br>
      srun: Job step aborted: Waiting up to 32 seconds for job step to
      finish.<br>
      slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
      2019-01-30T07:35:50 ***<br>
      srun: error: nodename: task 250: Terminated<br>
      srun: Terminating job step 3531.0<br>
    </blockquote>
    But this time, sacct reports<br>
    <blockquote><tt>       JobID               Start                 End
        ExitCode      State </tt><br>
      <tt>------------ ------------------- ------------------- --------
        ---------- </tt><br>
      <tt>3531         2019-01-30T07:21:25 2019-01-30T07:35:50      0:0 
        COMPLETED </tt><br>
      <tt>3531.0       2019-01-30T07:21:27 2019-01-30T07:35:56     0:15 
        CANCELLED </tt><br>
    </blockquote>
    I think we've seen these cancellations pop up as soon as a minute or
    two into the test run, up to perhaps 20 minutes into the run.<br>
    <br>
    The only thing slightly unusual in our job submissions is that we
    use srun's "--immediate=120" so that the scripts can respond
    appropriately if a node goes down.<br>
    <br>
    With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a
    clue in the slurmctld or slurmd logs.<br>
    <br>
    Any thoughts on what might be happening, or what I might try next?<br>
    <br>
    <tt></tt>
    <div id="smartTemplate4-quoteHeader">
      <hr> <b>From:</b> Chris Samuel <a class="moz-txt-link-rfc2396E" href="mailto:chris@csamuel.org"><chris@csamuel.org></a> <br>
      <b>Sent:</b> Saturday, February 02, 2019 1:38AM <br>
      <b>To:</b> Slurm-users <a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com"><slurm-users@lists.schedmd.com></a><br>
      <b>Cc:</b> <br>
      <b>Subject:</b> Re: [slurm-users] Mysterious job terminations on
      Slurm 17.11.10 <br>
    </div>
    <div class="replaced-blockquote" cite="mid:8157362.Z0MJsqAMyf@quad"
      type="cite">
      <pre class="moz-quote-pre" wrap="">On Friday, 1 February 2019 6:04:45 AM AEDT Andy Riebs wrote:

</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">Any thoughts on what might be happening, or what I might try next?
</pre>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">Anything in dmesg on the nodes or syslog at that time?

I'm wondering if you're seeing the OOM killer step in and take processes out.

What does your slurm.conf look like?

All the best,
Chris
</pre>
    </div>
    <br>
  </body>
</html>