<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
Hello,<br>
<br>
I'm looking for help in understanding a problem we're having such
that Slurm indicates that a job was killed, but not why. It's not
clear what's actually killing the jobs; we've seen jobs killed for
time limits and out-of-memory issues, and those reasons are obvious
in the logs when they happen, and that's not happening here.<br>
<br>
In Googling for the error messages, it seems like the jobs are
killed outside of Slurm, but the engineer insists that this is not
the case.<br>
<br>
This happens sporadically, maybe every one or two million jobs, and
is not reliably reproducible. I'm looking for any ways to gather
more information about the cause of these issues.<br>
<br>
Slurm version: 20.11.9<br>
<br>
The relevant messages:<br>
<br>
slurmctld.log:<br>
<br>
<font face="monospace">[2023-03-27T20:53:55.336] sched:
_slurm_rpc_allocate_resources JobId=31360187 NodeList=(null)
usec=5871<br>
[2023-03-27T20:54:16.753] sched: Allocate JobId=31360187
NodeList=cl4 #CPUs=1 Partition=build<br>
[2023-03-27T20:54:27.104] _job_complete: JobId=31360187 WTERMSIG 9<br>
[2023-03-27T20:54:27.104] _job_complete: JobId=31360187 done</font><br>
<br>
slurmd.log:<br>
<br>
<font face="monospace">[2023-03-27T20:54:23.978] launch task
StepId=31360187.0 request from UID:255 GID:100 HOST:10.52.49.107
PORT:59370<br>
[2023-03-27T20:54:23.979] task/affinity: lllp_distribution:
JobId=31360187 implicit auto binding: cores,one_thread, dist 1<br>
[2023-03-27T20:54:23.979] task/affinity: _lllp_generate_cpu_bind:
_lllp_generate_cpu_bind jobid [31360187]: mask_cpu,one_thread,
0x000008<br>
[2023-03-27T20:54:24.236] [31360187.0] task/cgroup:
_memcg_initialize: /slurm/uid_255/job_31360187: alloc=4096MB
mem.limit=4096MB memsw.limit=4096MB<br>
[2023-03-27T20:54:24.236] [31360187.0] task/cgroup:
_memcg_initialize: /slurm/uid_255/job_31360187/step_0:
alloc=4096MB mem.limit=4096MB memsw.limit=4096MB<br>
[2023-03-27T20:54:27.038] [31360187.0] error: *** STEP 31360187.0
ON cl4 CANCELLED AT 2023-03-27T20:54:27 ***<br>
[2023-03-27T20:54:27.099] [31360187.0] done with job</font><br>
<br>
srun output:<br>
<br>
<font face="monospace">srun: job 31360187 queued and waiting for
resources<br>
srun: job 31360187 has been allocated resources<br>
srun: jobid 31360187: nodes(1):`cl4', cpu counts: 1(x1)<br>
srun: launching StepId=31360187.0 on host cl4, 1 tasks: 0<br>
srun: launch/slurm: launch_p_step_launch: StepId=31360187.0
aborted before step completely launched.<br>
srun: Complete StepId=31360187.0+0 received<br>
slurmstepd: error: *** STEP 31360187.0 ON cl4 CANCELLED AT
2023-03-27T20:54:27 ***<br>
srun: launch/slurm: _task_finish: Received task exit notification
for 1 task of StepId=31360187.0 (status=0x0009).</font><br>
<br>
accounting:<br>
<br>
<font face="monospace"># sacct -o jobid,elapsed,reason,state,exit -j
31360187<br>
JobID Elapsed Reason State ExitCode
<br>
------------ ---------- ---------------------- ---------- --------
<br>
31360187 00:00:11 None FAILED 0:9
<br>
</font><br>
<br>
These are compile jobs run via srun. The srun command is of this
form (I've omitted the -I and -D parts as irrelevant and containing
private information):<br>
<br>
<font face="Courier New">( echo -n 'max=3126 ; printf "[%2d%%
%${#max}d/3126] %s\n" `expr 2090 \* 100 / 3126` 2090 "["c+11.2"]
$(printf "[slurm %4s %s]" $(uname -n) $SLURM_JOB_ID) objectfile.o"
; fs_sync.sh sourcefile.cpp Makefile.flags ; ' ; printf '%q ' g++
-MT objectfile.o -MMD -MP -MF optionfile.Td -m64 -Werror -W -Wall
-Wno-parentheses -Wno-unused-parameter -Wno-uninitialized
-Wno-maybe-uninitialized -Wno-misleading-indentation
-Wno-implicit-fallthrough -std=c++20 -g -g2 ) | srun -J rgrmake
-p build -N 1 -n 1 -c 1 --quit-on-interrupt --mem=4gb --verbose
bash && fs_sync.sh objectfile.o</font><br>
<br>
<br>
Slurm config:<br>
<br>
<font face="monospace">Configuration data as of 2023-03-31T16:01:44<br>
AccountingStorageBackupHost = (null)<br>
AccountingStorageEnforce = none<br>
AccountingStorageHost = podarkes<br>
AccountingStorageExternalHost = (null)<br>
AccountingStorageParameters = (null)<br>
AccountingStoragePort = 6819<br>
AccountingStorageTRES =
cpu,mem,energy,node,billing,fs/disk,vmem,pages<br>
AccountingStorageType = accounting_storage/slurmdbd<br>
AccountingStorageUser = N/A<br>
AccountingStoreJobComment = Yes<br>
AcctGatherEnergyType = acct_gather_energy/none<br>
AcctGatherFilesystemType = acct_gather_filesystem/none<br>
AcctGatherInterconnectType = acct_gather_interconnect/none<br>
AcctGatherNodeFreq = 0 sec<br>
AcctGatherProfileType = acct_gather_profile/none<br>
AllowSpecResourcesUsage = No<br>
AuthAltTypes = (null)<br>
AuthAltParameters = (null)<br>
AuthInfo = (null)<br>
AuthType = auth/munge<br>
BatchStartTimeout = 10 sec<br>
BOOT_TIME = 2023-02-21T10:02:56<br>
BurstBufferType = (null)<br>
CliFilterPlugins = (null)<br>
ClusterName = ri_cluster_v20<br>
CommunicationParameters = (null)<br>
CompleteWait = 0 sec<br>
CoreSpecPlugin = core_spec/none<br>
CpuFreqDef = Unknown<br>
CpuFreqGovernors = Performance,OnDemand,UserSpace<br>
CredType = cred/munge<br>
DebugFlags = NO_CONF_HASH<br>
DefMemPerNode = UNLIMITED<br>
DependencyParameters = (null)<br>
DisableRootJobs = No<br>
EioTimeout = 60<br>
EnforcePartLimits = NO<br>
Epilog = (null)<br>
EpilogMsgTime = 2000 usec<br>
EpilogSlurmctld = (null)<br>
ExtSensorsType = ext_sensors/none<br>
ExtSensorsFreq = 0 sec<br>
FederationParameters = (null)<br>
FirstJobId = 1<br>
GetEnvTimeout = 2 sec<br>
GresTypes = (null)<br>
GpuFreqDef = high,memory=high<br>
GroupUpdateForce = 1<br>
GroupUpdateTime = 600 sec<br>
HASH_VAL = Different Ours=0xf7a11381
Slurmctld=0x98e3b483<br>
HealthCheckInterval = 0 sec<br>
HealthCheckNodeState = ANY<br>
HealthCheckProgram = (null)<br>
InactiveLimit = 0 sec<br>
InteractiveStepOptions = --interactive --preserve-env --pty
$SHELL<br>
JobAcctGatherFrequency = 30<br>
JobAcctGatherType = jobacct_gather/linux<br>
JobAcctGatherParams = (null)<br>
JobCompHost = localhost<br>
JobCompLoc = /var/log/slurm_jobcomp.log<br>
JobCompPort = 0<br>
JobCompType = jobcomp/none<br>
JobCompUser = root<br>
JobContainerType = job_container/none<br>
JobCredentialPrivateKey = (null)<br>
JobCredentialPublicCertificate = (null)<br>
JobDefaults = (null)<br>
JobFileAppend = 0<br>
JobRequeue = 1<br>
JobSubmitPlugins = (null)<br>
KeepAliveTime = SYSTEM_DEFAULT<br>
KillOnBadExit = 0<br>
KillWait = 30 sec<br>
LaunchParameters = (null)<br>
LaunchType = launch/slurm<br>
Licenses = (null)<br>
LogTimeFormat = iso8601_ms<br>
MailDomain = (null)<br>
MailProg = /bin/mail<br>
MaxArraySize = 1001<br>
MaxDBDMsgs = 20112<br>
MaxJobCount = 10000<br>
MaxJobId = 67043328<br>
MaxMemPerNode = UNLIMITED<br>
MaxStepCount = 40000<br>
MaxTasksPerNode = 512<br>
MCSPlugin = mcs/none<br>
MCSParameters = (null)<br>
MessageTimeout = 60 sec<br>
MinJobAge = 300 sec<br>
MpiDefault = none<br>
MpiParams = (null)<br>
NEXT_JOB_ID = 31937596<br>
NodeFeaturesPlugins = (null)<br>
OverTimeLimit = 0 min<br>
PluginDir = /usr/lib64/slurm<br>
PlugStackConfig = (null)<br>
PowerParameters = (null)<br>
PowerPlugin = <br>
PreemptMode = GANG,SUSPEND<br>
PreemptType = preempt/partition_prio<br>
PreemptExemptTime = 00:02:00<br>
PrEpParameters = (null)<br>
PrEpPlugins = prep/script<br>
PriorityParameters = (null)<br>
PrioritySiteFactorParameters = (null)<br>
PrioritySiteFactorPlugin = (null)<br>
PriorityType = priority/basic<br>
PrivateData = none<br>
ProctrackType = proctrack/cgroup<br>
Prolog = (null)<br>
PrologEpilogTimeout = 65534<br>
PrologSlurmctld = (null)<br>
PrologFlags = (null)<br>
PropagatePrioProcess = 0<br>
PropagateResourceLimits = ALL<br>
PropagateResourceLimitsExcept = (null)<br>
RebootProgram = (null)<br>
ReconfigFlags = (null)<br>
RequeueExit = (null)<br>
RequeueExitHold = (null)<br>
ResumeFailProgram = (null)<br>
ResumeProgram = (null)<br>
ResumeRate = 300 nodes/min<br>
ResumeTimeout = 60 sec<br>
ResvEpilog = (null)<br>
ResvOverRun = 0 min<br>
ResvProlog = (null)<br>
ReturnToService = 2<br>
RoutePlugin = route/default<br>
SbcastParameters = (null)<br>
SchedulerParameters =
batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_resolution=600,bf_yield_interval=1000000,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=2000000<br>
SchedulerTimeSlice = 30 sec<br>
SchedulerType = sched/backfill<br>
ScronParameters = (null)<br>
SelectType = select/cons_res<br>
SelectTypeParameters = CR_CORE_MEMORY<br>
SlurmUser = slurm(471)<br>
SlurmctldAddr = (null)<br>
SlurmctldDebug = info<br>
SlurmctldHost[0] = clctl1<br>
SlurmctldLogFile = /var/log/slurm/slurmctld.log<br>
SlurmctldPort = 6816-6817<br>
SlurmctldSyslogDebug = unknown<br>
SlurmctldPrimaryOffProg = (null)<br>
SlurmctldPrimaryOnProg = (null)<br>
SlurmctldTimeout = 120 sec<br>
SlurmctldParameters = (null)<br>
SlurmdDebug = info<br>
SlurmdLogFile = /var/log/slurm/slurmd.log<br>
SlurmdParameters = (null)<br>
SlurmdPidFile = /var/run/slurmd.pid<br>
SlurmdPort = 6818<br>
SlurmdSpoolDir = /var/spool/slurmd<br>
SlurmdSyslogDebug = unknown<br>
SlurmdTimeout = 300 sec<br>
SlurmdUser = root(0)<br>
SlurmSchedLogFile = (null)<br>
SlurmSchedLogLevel = 0<br>
SlurmctldPidFile = /var/run/slurmctld.pid<br>
SlurmctldPlugstack = (null)<br>
SLURM_CONF = /etc/slurm/slurm.conf<br>
SLURM_VERSION = 20.11.9<br>
SrunEpilog = (null)<br>
SrunPortRange = 0-0<br>
SrunProlog = (null)<br>
StateSaveLocation = /data/slurm/spool<br>
SuspendExcNodes = (null)<br>
SuspendExcParts = (null)<br>
SuspendProgram = (null)<br>
SuspendRate = 60 nodes/min<br>
SuspendTime = NONE<br>
SuspendTimeout = 30 sec<br>
SwitchType = switch/none<br>
TaskEpilog = (null)<br>
TaskPlugin = task/affinity,task/cgroup<br>
TaskPluginParam = (null type)<br>
TaskProlog = (null)<br>
TCPTimeout = 2 sec<br>
TmpFS = /tmp<br>
TopologyParam = (null)<br>
TopologyPlugin = topology/none<br>
TrackWCKey = No<br>
TreeWidth = 255<br>
UsePam = No<br>
UnkillableStepProgram = (null)<br>
UnkillableStepTimeout = 60 sec<br>
VSizeFactor = 0 percent<br>
WaitTime = 0 sec<br>
X11Parameters = (null)<br>
<br>
Cgroup Support Configuration:<br>
AllowedDevicesFile =
/etc/slurm/cgroup_allowed_devices_file.conf<br>
AllowedKmemSpace = (null)<br>
AllowedRAMSpace = 100.0%<br>
AllowedSwapSpace = 0.0%<br>
CgroupAutomount = yes<br>
CgroupMountpoint = /cgroup<br>
ConstrainCores = yes<br>
ConstrainDevices = no<br>
ConstrainKmemSpace = no<br>
ConstrainRAMSpace = yes<br>
ConstrainSwapSpace = yes<br>
MaxKmemPercent = 100.0%<br>
MaxRAMPercent = 100.0%<br>
MaxSwapPercent = 100.0%<br>
MemorySwappiness = (null)<br>
MinKmemSpace = 30 MB<br>
MinRAMSpace = 30 MB<br>
TaskAffinity = no<br>
<br>
Slurmctld(primary) at clctl1 is UP</font><br>
<br>
<br>
Please let me know if any other information is needed to understand
this. Any help is appreciated.<br>
<br>
Thanks,<br>
-rob<br>
<br>
</body>
</html>