<div dir="ltr"><div dir="auto">Hi,</div><div dir="auto"><br></div><div>I don't think I have ever seen a sig 9 that wasn't a user. Is it possible you have folks in slurm coordinator/administrator that may be killing jobs or run running a cleanup script? Only other thing I can think of is the user is closing their remote session before the srun completes. I can't recall right now but oom might be working. dmesg -T | grep oom to see if the OS is wiping out jobs to recover memory. <br></div><div><br></div><div>Doug<br></div><div dir="auto"><div dir="auto"><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Apr 3, 2023, 8:56 AM Robert Barton <<a href="mailto:rob@realintent.com" target="_blank">rob@realintent.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
Hello,<br>
<br>
I'm looking for help in understanding a problem we're having such
that Slurm indicates that a job was killed, but not why. It's not
clear what's actually killing the jobs; we've seen jobs killed for
time limits and out-of-memory issues, and those reasons are obvious
in the logs when they happen, and that's not happening here.<br>
<br>
In Googling for the error messages, it seems like the jobs are
killed outside of Slurm, but the engineer insists that this is not
the case.<br>
<br>
This happens sporadically, maybe every one or two million jobs, and
is not reliably reproducible. I'm looking for any ways to gather
more information about the cause of these issues.<br>
<br>
Slurm version: 20.11.9<br>
<br>
The relevant messages:<br>
<br>
slurmctld.log:<br>
<br>
<font face="monospace">[2023-03-27T20:53:55.336] sched:
_slurm_rpc_allocate_resources JobId=31360187 NodeList=(null)
usec=5871<br>
[2023-03-27T20:54:16.753] sched: Allocate JobId=31360187
NodeList=cl4 #CPUs=1 Partition=build<br>
[2023-03-27T20:54:27.104] _job_complete: JobId=31360187 WTERMSIG 9<br>
[2023-03-27T20:54:27.104] _job_complete: JobId=31360187 done</font><br>
<br>
slurmd.log:<br>
<br>
<font face="monospace">[2023-03-27T20:54:23.978] launch task
StepId=31360187.0 request from UID:255 GID:100 HOST:10.52.49.107
PORT:59370<br>
[2023-03-27T20:54:23.979] task/affinity: lllp_distribution:
JobId=31360187 implicit auto binding: cores,one_thread, dist 1<br>
[2023-03-27T20:54:23.979] task/affinity: _lllp_generate_cpu_bind:
_lllp_generate_cpu_bind jobid [31360187]: mask_cpu,one_thread,
0x000008<br>
[2023-03-27T20:54:24.236] [31360187.0] task/cgroup:
_memcg_initialize: /slurm/uid_255/job_31360187: alloc=4096MB
mem.limit=4096MB memsw.limit=4096MB<br>
[2023-03-27T20:54:24.236] [31360187.0] task/cgroup:
_memcg_initialize: /slurm/uid_255/job_31360187/step_0:
alloc=4096MB mem.limit=4096MB memsw.limit=4096MB<br>
[2023-03-27T20:54:27.038] [31360187.0] error: *** STEP 31360187.0
ON cl4 CANCELLED AT 2023-03-27T20:54:27 ***<br>
[2023-03-27T20:54:27.099] [31360187.0] done with job</font><br>
<br>
srun output:<br>
<br>
<font face="monospace">srun: job 31360187 queued and waiting for
resources<br>
srun: job 31360187 has been allocated resources<br>
srun: jobid 31360187: nodes(1):`cl4', cpu counts: 1(x1)<br>
srun: launching StepId=31360187.0 on host cl4, 1 tasks: 0<br>
srun: launch/slurm: launch_p_step_launch: StepId=31360187.0
aborted before step completely launched.<br>
srun: Complete StepId=31360187.0+0 received<br>
slurmstepd: error: *** STEP 31360187.0 ON cl4 CANCELLED AT
2023-03-27T20:54:27 ***<br>
srun: launch/slurm: _task_finish: Received task exit notification
for 1 task of StepId=31360187.0 (status=0x0009).</font><br>
<br>
accounting:<br>
<br>
<font face="monospace"># sacct -o jobid,elapsed,reason,state,exit -j
31360187<br>
      JobID   Elapsed                Reason     State ExitCode
<br>
------------ ---------- ---------------------- ---------- --------
<br>
31360187      00:00:11                  None    FAILED     0:9
<br>
</font><br>
<br>
These are compile jobs run via srun. The srun command is of this
form (I've omitted the -I and -D parts as irrelevant and containing
private information):<br>
<br>
<font face="Courier New">( echo -n 'max=3126 ; printf "[%2d%%
%${#max}d/3126] %s\n" `expr 2090 \* 100 / 3126` 2090 "["c+11.2"]
$(printf "[slurm %4s %s]" $(uname -n) $SLURM_JOB_ID) objectfile.o"
; fs_sync.sh sourcefile.cpp Makefile.flags ; ' ; printf '%q ' g++
-MT objectfile.o -MMD -MP -MF optionfile.Td -m64 -Werror -W -Wall
-Wno-parentheses -Wno-unused-parameter -Wno-uninitialized
-Wno-maybe-uninitialized  -Wno-misleading-indentation
-Wno-implicit-fallthrough -std=c++20 -g -g2 ) | srun  -J rgrmake
-p build -N 1 -n 1 -c 1 --quit-on-interrupt --mem=4gb --verbose
bash  && fs_sync.sh objectfile.o</font><br>
<br>
<br>
Slurm config:<br>
<br>
<font face="monospace">Configuration data as of 2023-03-31T16:01:44<br>
AccountingStorageBackupHost = (null)<br>
AccountingStorageEnforce = none<br>
AccountingStorageHost  = podarkes<br>
AccountingStorageExternalHost = (null)<br>
AccountingStorageParameters = (null)<br>
AccountingStoragePort  = 6819<br>
AccountingStorageTRESÂ Â =
cpu,mem,energy,node,billing,fs/disk,vmem,pages<br>
AccountingStorageType  = accounting_storage/slurmdbd<br>
AccountingStorageUser  = N/A<br>
AccountingStoreJobComment = Yes<br>
AcctGatherEnergyType   = acct_gather_energy/none<br>
AcctGatherFilesystemType = acct_gather_filesystem/none<br>
AcctGatherInterconnectType = acct_gather_interconnect/none<br>
AcctGatherNodeFreq     = 0 sec<br>
AcctGatherProfileType  = acct_gather_profile/none<br>
AllowSpecResourcesUsage = No<br>
AuthAltTypes           = (null)<br>
AuthAltParameters      = (null)<br>
AuthInfo               = (null)<br>
AuthType               = auth/munge<br>
BatchStartTimeout      = 10 sec<br>
BOOT_TIMEÂ Â Â Â Â Â Â Â Â Â Â Â Â Â = 2023-02-21T10:02:56<br>
BurstBufferType        = (null)<br>
CliFilterPlugins       = (null)<br>
ClusterName            = ri_cluster_v20<br>
CommunicationParameters = (null)<br>
CompleteWait           = 0 sec<br>
CoreSpecPlugin         = core_spec/none<br>
CpuFreqDef             = Unknown<br>
CpuFreqGovernors       = Performance,OnDemand,UserSpace<br>
CredType               = cred/munge<br>
DebugFlags             = NO_CONF_HASH<br>
DefMemPerNode          = UNLIMITED<br>
DependencyParameters   = (null)<br>
DisableRootJobs        = No<br>
EioTimeout             = 60<br>
EnforcePartLimits      = NO<br>
Epilog                 = (null)<br>
EpilogMsgTime          = 2000 usec<br>
EpilogSlurmctld        = (null)<br>
ExtSensorsType         = ext_sensors/none<br>
ExtSensorsFreq         = 0 sec<br>
FederationParameters   = (null)<br>
FirstJobId             = 1<br>
GetEnvTimeout          = 2 sec<br>
GresTypes              = (null)<br>
GpuFreqDef             = high,memory=high<br>
GroupUpdateForce       = 1<br>
GroupUpdateTime        = 600 sec<br>
HASH_VALÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â = Different Ours=0xf7a11381
Slurmctld=0x98e3b483<br>
HealthCheckInterval    = 0 sec<br>
HealthCheckNodeState   = ANY<br>
HealthCheckProgram     = (null)<br>
InactiveLimit          = 0 sec<br>
InteractiveStepOptions = --interactive --preserve-env --pty
$SHELL<br>
JobAcctGatherFrequency = 30<br>
JobAcctGatherType      = jobacct_gather/linux<br>
JobAcctGatherParams    = (null)<br>
JobCompHost            = localhost<br>
JobCompLoc             = /var/log/slurm_jobcomp.log<br>
JobCompPort            = 0<br>
JobCompType            = jobcomp/none<br>
JobCompUser            = root<br>
JobContainerType       = job_container/none<br>
JobCredentialPrivateKey = (null)<br>
JobCredentialPublicCertificate = (null)<br>
JobDefaults            = (null)<br>
JobFileAppend          = 0<br>
JobRequeue             = 1<br>
JobSubmitPlugins       = (null)<br>
KeepAliveTime          = SYSTEM_DEFAULT<br>
KillOnBadExit          = 0<br>
KillWait               = 30 sec<br>
LaunchParameters       = (null)<br>
LaunchType             = launch/slurm<br>
Licenses               = (null)<br>
LogTimeFormat          = iso8601_ms<br>
MailDomain             = (null)<br>
MailProg               = /bin/mail<br>
MaxArraySize           = 1001<br>
MaxDBDMsgs             = 20112<br>
MaxJobCount            = 10000<br>
MaxJobId               = 67043328<br>
MaxMemPerNode          = UNLIMITED<br>
MaxStepCount           = 40000<br>
MaxTasksPerNode        = 512<br>
MCSPlugin              = mcs/none<br>
MCSParameters          = (null)<br>
MessageTimeout         = 60 sec<br>
MinJobAge              = 300 sec<br>
MpiDefault             = none<br>
MpiParams              = (null)<br>
NEXT_JOB_IDÂ Â Â Â Â Â Â Â Â Â Â Â = 31937596<br>
NodeFeaturesPlugins    = (null)<br>
OverTimeLimit          = 0 min<br>
PluginDir              = /usr/lib64/slurm<br>
PlugStackConfig        = (null)<br>
PowerParameters        = (null)<br>
PowerPlugin            = <br>
PreemptMode            = GANG,SUSPEND<br>
PreemptType            = preempt/partition_prio<br>
PreemptExemptTime      = 00:02:00<br>
PrEpParameters         = (null)<br>
PrEpPlugins            = prep/script<br>
PriorityParameters     = (null)<br>
PrioritySiteFactorParameters = (null)<br>
PrioritySiteFactorPlugin = (null)<br>
PriorityType           = priority/basic<br>
PrivateData            = none<br>
ProctrackType          = proctrack/cgroup<br>
Prolog                 = (null)<br>
PrologEpilogTimeout    = 65534<br>
PrologSlurmctld        = (null)<br>
PrologFlags            = (null)<br>
PropagatePrioProcess   = 0<br>
PropagateResourceLimits = ALL<br>
PropagateResourceLimitsExcept = (null)<br>
RebootProgram          = (null)<br>
ReconfigFlags          = (null)<br>
RequeueExit            = (null)<br>
RequeueExitHold        = (null)<br>
ResumeFailProgram      = (null)<br>
ResumeProgram          = (null)<br>
ResumeRate             = 300 nodes/min<br>
ResumeTimeout          = 60 sec<br>
ResvEpilog             = (null)<br>
ResvOverRun            = 0 min<br>
ResvProlog             = (null)<br>
ReturnToService        = 2<br>
RoutePlugin            = route/default<br>
SbcastParameters       = (null)<br>
SchedulerParameters    =
batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_resolution=600,bf_yield_interval=1000000,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=2000000<br>
SchedulerTimeSlice     = 30 sec<br>
SchedulerType          = sched/backfill<br>
ScronParameters        = (null)<br>
SelectType             = select/cons_res<br>
SelectTypeParameters   = CR_CORE_MEMORY<br>
SlurmUser              = slurm(471)<br>
SlurmctldAddr          = (null)<br>
SlurmctldDebug         = info<br>
SlurmctldHost[0]Â Â Â Â Â Â Â = clctl1<br>
SlurmctldLogFile       = /var/log/slurm/slurmctld.log<br>
SlurmctldPort          = 6816-6817<br>
SlurmctldSyslogDebug   = unknown<br>
SlurmctldPrimaryOffProg = (null)<br>
SlurmctldPrimaryOnProg = (null)<br>
SlurmctldTimeout       = 120 sec<br>
SlurmctldParameters    = (null)<br>
SlurmdDebug            = info<br>
SlurmdLogFile          = /var/log/slurm/slurmd.log<br>
SlurmdParameters       = (null)<br>
SlurmdPidFile          = /var/run/slurmd.pid<br>
SlurmdPort             = 6818<br>
SlurmdSpoolDir         = /var/spool/slurmd<br>
SlurmdSyslogDebug      = unknown<br>
SlurmdTimeout          = 300 sec<br>
SlurmdUser             = root(0)<br>
SlurmSchedLogFile      = (null)<br>
SlurmSchedLogLevel     = 0<br>
SlurmctldPidFile       = /var/run/slurmctld.pid<br>
SlurmctldPlugstack     = (null)<br>
SLURM_CONFÂ Â Â Â Â Â Â Â Â Â Â Â Â = /etc/slurm/slurm.conf<br>
SLURM_VERSIONÂ Â Â Â Â Â Â Â Â Â = 20.11.9<br>
SrunEpilog             = (null)<br>
SrunPortRange          = 0-0<br>
SrunProlog             = (null)<br>
StateSaveLocation      = /data/slurm/spool<br>
SuspendExcNodes        = (null)<br>
SuspendExcParts        = (null)<br>
SuspendProgram         = (null)<br>
SuspendRate            = 60 nodes/min<br>
SuspendTime            = NONE<br>
SuspendTimeout         = 30 sec<br>
SwitchType             = switch/none<br>
TaskEpilog             = (null)<br>
TaskPlugin             = task/affinity,task/cgroup<br>
TaskPluginParam        = (null type)<br>
TaskProlog             = (null)<br>
TCPTimeout             = 2 sec<br>
TmpFSÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â = /tmp<br>
TopologyParam          = (null)<br>
TopologyPlugin         = topology/none<br>
TrackWCKey             = No<br>
TreeWidth              = 255<br>
UsePam                 = No<br>
UnkillableStepProgram  = (null)<br>
UnkillableStepTimeout  = 60 sec<br>
VSizeFactor            = 0 percent<br>
WaitTime               = 0 sec<br>
X11Parameters          = (null)<br>
<br>
Cgroup Support Configuration:<br>
AllowedDevicesFile     =
/etc/slurm/cgroup_allowed_devices_file.conf<br>
AllowedKmemSpace       = (null)<br>
AllowedRAMSpace        = 100.0%<br>
AllowedSwapSpace       = 0.0%<br>
CgroupAutomount        = yes<br>
CgroupMountpoint       = /cgroup<br>
ConstrainCores         = yes<br>
ConstrainDevices       = no<br>
ConstrainKmemSpace     = no<br>
ConstrainRAMSpace      = yes<br>
ConstrainSwapSpace     = yes<br>
MaxKmemPercent         = 100.0%<br>
MaxRAMPercent          = 100.0%<br>
MaxSwapPercent         = 100.0%<br>
MemorySwappiness       = (null)<br>
MinKmemSpace           = 30 MB<br>
MinRAMSpace            = 30 MB<br>
TaskAffinity           = no<br>
<br>
Slurmctld(primary) at clctl1 is UP</font><br>
<br>
<br>
Please let me know if any other information is needed to understand
this. Any help is appreciated.<br>
<br>
Thanks,<br>
-rob<br>
<br>
</div>
</blockquote></div>