<div dir="ltr"><div dir="auto">Hi,</div><div dir="auto"><br></div><div>I don't think I have ever seen a sig 9 that wasn't a user.  Is it possible you have folks in slurm coordinator/administrator that may be killing jobs or run running a cleanup script?  Only other thing I can think of is the user is closing their remote session before the srun completes. I can't recall right now but oom might be working.  dmesg -T | grep oom to see if the OS is wiping out jobs to recover memory.  <br></div><div><br></div><div>Doug<br></div><div dir="auto"><div dir="auto"><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 3, 2023, 8:56 AM Robert Barton <<a href="mailto:rob@realintent.com" target="_blank">rob@realintent.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div>
    Hello,<br>
    <br>
    I'm looking for help in understanding a problem we're having such
    that Slurm indicates that a job was killed, but not why. It's not
    clear what's actually killing the jobs; we've seen jobs killed for
    time limits and out-of-memory issues, and those reasons are obvious
    in the logs when they happen, and that's not happening here.<br>
    <br>
    In Googling for the error messages, it seems like the jobs are
    killed outside of Slurm, but the engineer insists that this is not
    the case.<br>
    <br>
    This happens sporadically, maybe every one or two million jobs, and
    is not reliably reproducible. I'm looking for any ways to gather
    more information about the cause of these issues.<br>
    <br>
    Slurm version: 20.11.9<br>
    <br>
    The relevant messages:<br>
    <br>
    slurmctld.log:<br>
    <br>
    <font face="monospace">[2023-03-27T20:53:55.336] sched:
      _slurm_rpc_allocate_resources JobId=31360187 NodeList=(null)
      usec=5871<br>
      [2023-03-27T20:54:16.753] sched: Allocate JobId=31360187
      NodeList=cl4 #CPUs=1 Partition=build<br>
      [2023-03-27T20:54:27.104] _job_complete: JobId=31360187 WTERMSIG 9<br>
      [2023-03-27T20:54:27.104] _job_complete: JobId=31360187 done</font><br>
    <br>
    slurmd.log:<br>
    <br>
    <font face="monospace">[2023-03-27T20:54:23.978] launch task
      StepId=31360187.0 request from UID:255 GID:100 HOST:10.52.49.107
      PORT:59370<br>
      [2023-03-27T20:54:23.979] task/affinity: lllp_distribution:
      JobId=31360187 implicit auto binding: cores,one_thread, dist 1<br>
      [2023-03-27T20:54:23.979] task/affinity: _lllp_generate_cpu_bind:
      _lllp_generate_cpu_bind jobid [31360187]: mask_cpu,one_thread,
      0x000008<br>
      [2023-03-27T20:54:24.236] [31360187.0] task/cgroup:
      _memcg_initialize: /slurm/uid_255/job_31360187: alloc=4096MB
      mem.limit=4096MB memsw.limit=4096MB<br>
      [2023-03-27T20:54:24.236] [31360187.0] task/cgroup:
      _memcg_initialize: /slurm/uid_255/job_31360187/step_0:
      alloc=4096MB mem.limit=4096MB memsw.limit=4096MB<br>
      [2023-03-27T20:54:27.038] [31360187.0] error: *** STEP 31360187.0
      ON cl4 CANCELLED AT 2023-03-27T20:54:27 ***<br>
      [2023-03-27T20:54:27.099] [31360187.0] done with job</font><br>
    <br>
    srun output:<br>
    <br>
    <font face="monospace">srun: job 31360187 queued and waiting for
      resources<br>
      srun: job 31360187 has been allocated resources<br>
      srun: jobid 31360187: nodes(1):`cl4', cpu counts: 1(x1)<br>
      srun: launching StepId=31360187.0 on host cl4, 1 tasks: 0<br>
      srun: launch/slurm: launch_p_step_launch: StepId=31360187.0
      aborted before step completely launched.<br>
      srun: Complete StepId=31360187.0+0 received<br>
      slurmstepd: error: *** STEP 31360187.0 ON cl4 CANCELLED AT
      2023-03-27T20:54:27 ***<br>
      srun: launch/slurm: _task_finish: Received task exit notification
      for 1 task of StepId=31360187.0 (status=0x0009).</font><br>
    <br>
    accounting:<br>
    <br>
    <font face="monospace"># sacct -o jobid,elapsed,reason,state,exit -j
      31360187<br>
             JobID    Elapsed                 Reason      State ExitCode
      <br>
      ------------ ---------- ---------------------- ---------- --------
      <br>
      31360187       00:00:11                   None     FAILED      0:9
      <br>
    </font><br>
    <br>
    These are compile jobs run via srun. The srun command is of this
    form (I've omitted the -I and -D parts as irrelevant and containing
    private information):<br>
    <br>
    <font face="Courier New">( echo -n 'max=3126 ; printf "[%2d%%
      %${#max}d/3126] %s\n" `expr 2090 \* 100 / 3126` 2090 "["c+11.2"]
      $(printf "[slurm %4s %s]" $(uname -n) $SLURM_JOB_ID) objectfile.o"
      ; fs_sync.sh sourcefile.cpp Makefile.flags ; ' ; printf '%q ' g++
      -MT objectfile.o -MMD -MP -MF optionfile.Td -m64 -Werror -W -Wall
      -Wno-parentheses -Wno-unused-parameter -Wno-uninitialized
      -Wno-maybe-uninitialized  -Wno-misleading-indentation
      -Wno-implicit-fallthrough -std=c++20 -g -g2 ) | srun  -J rgrmake
      -p build -N 1 -n 1 -c 1 --quit-on-interrupt --mem=4gb --verbose
      bash  && fs_sync.sh objectfile.o</font><br>
    <br>
    <br>
    Slurm config:<br>
    <br>
    <font face="monospace">Configuration data as of 2023-03-31T16:01:44<br>
      AccountingStorageBackupHost = (null)<br>
      AccountingStorageEnforce = none<br>
      AccountingStorageHost   = podarkes<br>
      AccountingStorageExternalHost = (null)<br>
      AccountingStorageParameters = (null)<br>
      AccountingStoragePort   = 6819<br>
      AccountingStorageTRES   =
      cpu,mem,energy,node,billing,fs/disk,vmem,pages<br>
      AccountingStorageType   = accounting_storage/slurmdbd<br>
      AccountingStorageUser   = N/A<br>
      AccountingStoreJobComment = Yes<br>
      AcctGatherEnergyType    = acct_gather_energy/none<br>
      AcctGatherFilesystemType = acct_gather_filesystem/none<br>
      AcctGatherInterconnectType = acct_gather_interconnect/none<br>
      AcctGatherNodeFreq      = 0 sec<br>
      AcctGatherProfileType   = acct_gather_profile/none<br>
      AllowSpecResourcesUsage = No<br>
      AuthAltTypes            = (null)<br>
      AuthAltParameters       = (null)<br>
      AuthInfo                = (null)<br>
      AuthType                = auth/munge<br>
      BatchStartTimeout       = 10 sec<br>
      BOOT_TIME               = 2023-02-21T10:02:56<br>
      BurstBufferType         = (null)<br>
      CliFilterPlugins        = (null)<br>
      ClusterName             = ri_cluster_v20<br>
      CommunicationParameters = (null)<br>
      CompleteWait            = 0 sec<br>
      CoreSpecPlugin          = core_spec/none<br>
      CpuFreqDef              = Unknown<br>
      CpuFreqGovernors        = Performance,OnDemand,UserSpace<br>
      CredType                = cred/munge<br>
      DebugFlags              = NO_CONF_HASH<br>
      DefMemPerNode           = UNLIMITED<br>
      DependencyParameters    = (null)<br>
      DisableRootJobs         = No<br>
      EioTimeout              = 60<br>
      EnforcePartLimits       = NO<br>
      Epilog                  = (null)<br>
      EpilogMsgTime           = 2000 usec<br>
      EpilogSlurmctld         = (null)<br>
      ExtSensorsType          = ext_sensors/none<br>
      ExtSensorsFreq          = 0 sec<br>
      FederationParameters    = (null)<br>
      FirstJobId              = 1<br>
      GetEnvTimeout           = 2 sec<br>
      GresTypes               = (null)<br>
      GpuFreqDef              = high,memory=high<br>
      GroupUpdateForce        = 1<br>
      GroupUpdateTime         = 600 sec<br>
      HASH_VAL                = Different Ours=0xf7a11381
      Slurmctld=0x98e3b483<br>
      HealthCheckInterval     = 0 sec<br>
      HealthCheckNodeState    = ANY<br>
      HealthCheckProgram      = (null)<br>
      InactiveLimit           = 0 sec<br>
      InteractiveStepOptions  = --interactive --preserve-env --pty
      $SHELL<br>
      JobAcctGatherFrequency  = 30<br>
      JobAcctGatherType       = jobacct_gather/linux<br>
      JobAcctGatherParams     = (null)<br>
      JobCompHost             = localhost<br>
      JobCompLoc              = /var/log/slurm_jobcomp.log<br>
      JobCompPort             = 0<br>
      JobCompType             = jobcomp/none<br>
      JobCompUser             = root<br>
      JobContainerType        = job_container/none<br>
      JobCredentialPrivateKey = (null)<br>
      JobCredentialPublicCertificate = (null)<br>
      JobDefaults             = (null)<br>
      JobFileAppend           = 0<br>
      JobRequeue              = 1<br>
      JobSubmitPlugins        = (null)<br>
      KeepAliveTime           = SYSTEM_DEFAULT<br>
      KillOnBadExit           = 0<br>
      KillWait                = 30 sec<br>
      LaunchParameters        = (null)<br>
      LaunchType              = launch/slurm<br>
      Licenses                = (null)<br>
      LogTimeFormat           = iso8601_ms<br>
      MailDomain              = (null)<br>
      MailProg                = /bin/mail<br>
      MaxArraySize            = 1001<br>
      MaxDBDMsgs              = 20112<br>
      MaxJobCount             = 10000<br>
      MaxJobId                = 67043328<br>
      MaxMemPerNode           = UNLIMITED<br>
      MaxStepCount            = 40000<br>
      MaxTasksPerNode         = 512<br>
      MCSPlugin               = mcs/none<br>
      MCSParameters           = (null)<br>
      MessageTimeout          = 60 sec<br>
      MinJobAge               = 300 sec<br>
      MpiDefault              = none<br>
      MpiParams               = (null)<br>
      NEXT_JOB_ID             = 31937596<br>
      NodeFeaturesPlugins     = (null)<br>
      OverTimeLimit           = 0 min<br>
      PluginDir               = /usr/lib64/slurm<br>
      PlugStackConfig         = (null)<br>
      PowerParameters         = (null)<br>
      PowerPlugin             = <br>
      PreemptMode             = GANG,SUSPEND<br>
      PreemptType             = preempt/partition_prio<br>
      PreemptExemptTime       = 00:02:00<br>
      PrEpParameters          = (null)<br>
      PrEpPlugins             = prep/script<br>
      PriorityParameters      = (null)<br>
      PrioritySiteFactorParameters = (null)<br>
      PrioritySiteFactorPlugin = (null)<br>
      PriorityType            = priority/basic<br>
      PrivateData             = none<br>
      ProctrackType           = proctrack/cgroup<br>
      Prolog                  = (null)<br>
      PrologEpilogTimeout     = 65534<br>
      PrologSlurmctld         = (null)<br>
      PrologFlags             = (null)<br>
      PropagatePrioProcess    = 0<br>
      PropagateResourceLimits = ALL<br>
      PropagateResourceLimitsExcept = (null)<br>
      RebootProgram           = (null)<br>
      ReconfigFlags           = (null)<br>
      RequeueExit             = (null)<br>
      RequeueExitHold         = (null)<br>
      ResumeFailProgram       = (null)<br>
      ResumeProgram           = (null)<br>
      ResumeRate              = 300 nodes/min<br>
      ResumeTimeout           = 60 sec<br>
      ResvEpilog              = (null)<br>
      ResvOverRun             = 0 min<br>
      ResvProlog              = (null)<br>
      ReturnToService         = 2<br>
      RoutePlugin             = route/default<br>
      SbcastParameters        = (null)<br>
      SchedulerParameters     =
batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_resolution=600,bf_yield_interval=1000000,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=2000000<br>
      SchedulerTimeSlice      = 30 sec<br>
      SchedulerType           = sched/backfill<br>
      ScronParameters         = (null)<br>
      SelectType              = select/cons_res<br>
      SelectTypeParameters    = CR_CORE_MEMORY<br>
      SlurmUser               = slurm(471)<br>
      SlurmctldAddr           = (null)<br>
      SlurmctldDebug          = info<br>
      SlurmctldHost[0]        = clctl1<br>
      SlurmctldLogFile        = /var/log/slurm/slurmctld.log<br>
      SlurmctldPort           = 6816-6817<br>
      SlurmctldSyslogDebug    = unknown<br>
      SlurmctldPrimaryOffProg = (null)<br>
      SlurmctldPrimaryOnProg  = (null)<br>
      SlurmctldTimeout        = 120 sec<br>
      SlurmctldParameters     = (null)<br>
      SlurmdDebug             = info<br>
      SlurmdLogFile           = /var/log/slurm/slurmd.log<br>
      SlurmdParameters        = (null)<br>
      SlurmdPidFile           = /var/run/slurmd.pid<br>
      SlurmdPort              = 6818<br>
      SlurmdSpoolDir          = /var/spool/slurmd<br>
      SlurmdSyslogDebug       = unknown<br>
      SlurmdTimeout           = 300 sec<br>
      SlurmdUser              = root(0)<br>
      SlurmSchedLogFile       = (null)<br>
      SlurmSchedLogLevel      = 0<br>
      SlurmctldPidFile        = /var/run/slurmctld.pid<br>
      SlurmctldPlugstack      = (null)<br>
      SLURM_CONF              = /etc/slurm/slurm.conf<br>
      SLURM_VERSION           = 20.11.9<br>
      SrunEpilog              = (null)<br>
      SrunPortRange           = 0-0<br>
      SrunProlog              = (null)<br>
      StateSaveLocation       = /data/slurm/spool<br>
      SuspendExcNodes         = (null)<br>
      SuspendExcParts         = (null)<br>
      SuspendProgram          = (null)<br>
      SuspendRate             = 60 nodes/min<br>
      SuspendTime             = NONE<br>
      SuspendTimeout          = 30 sec<br>
      SwitchType              = switch/none<br>
      TaskEpilog              = (null)<br>
      TaskPlugin              = task/affinity,task/cgroup<br>
      TaskPluginParam         = (null type)<br>
      TaskProlog              = (null)<br>
      TCPTimeout              = 2 sec<br>
      TmpFS                   = /tmp<br>
      TopologyParam           = (null)<br>
      TopologyPlugin          = topology/none<br>
      TrackWCKey              = No<br>
      TreeWidth               = 255<br>
      UsePam                  = No<br>
      UnkillableStepProgram   = (null)<br>
      UnkillableStepTimeout   = 60 sec<br>
      VSizeFactor             = 0 percent<br>
      WaitTime                = 0 sec<br>
      X11Parameters           = (null)<br>
      <br>
      Cgroup Support Configuration:<br>
      AllowedDevicesFile      =
      /etc/slurm/cgroup_allowed_devices_file.conf<br>
      AllowedKmemSpace        = (null)<br>
      AllowedRAMSpace         = 100.0%<br>
      AllowedSwapSpace        = 0.0%<br>
      CgroupAutomount         = yes<br>
      CgroupMountpoint        = /cgroup<br>
      ConstrainCores          = yes<br>
      ConstrainDevices        = no<br>
      ConstrainKmemSpace      = no<br>
      ConstrainRAMSpace       = yes<br>
      ConstrainSwapSpace      = yes<br>
      MaxKmemPercent          = 100.0%<br>
      MaxRAMPercent           = 100.0%<br>
      MaxSwapPercent          = 100.0%<br>
      MemorySwappiness        = (null)<br>
      MinKmemSpace            = 30 MB<br>
      MinRAMSpace             = 30 MB<br>
      TaskAffinity            = no<br>
      <br>
      Slurmctld(primary) at clctl1 is UP</font><br>
    <br>
    <br>
    Please let me know if any other information is needed to understand
    this. Any help is appreciated.<br>
    <br>
    Thanks,<br>
    -rob<br>
    <br>
  </div>

</blockquote></div>