Robert Barton rob at realintent.com
Mon Apr 3 15:52:49 UTC 2023


I'm looking for help in understanding a problem we're having such that 
Slurm indicates that a job was killed, but not why. It's not clear 
what's actually killing the jobs; we've seen jobs killed for time limits 
and out-of-memory issues, and those reasons are obvious in the logs when 
they happen, and that's not happening here.

In Googling for the error messages, it seems like the jobs are killed 
outside of Slurm, but the engineer insists that this is not the case.

This happens sporadically, maybe every one or two million jobs, and is 
not reliably reproducible. I'm looking for any ways to gather more 
information about the cause of these issues.

Slurm version: 20.11.9

The relevant messages:


[2023-03-27T20:53:55.336] sched: _slurm_rpc_allocate_resources 
JobId=31360187 NodeList=(null) usec=5871
[2023-03-27T20:54:16.753] sched: Allocate JobId=31360187 NodeList=cl4 
#CPUs=1 Partition=build
[2023-03-27T20:54:27.104] _job_complete: JobId=31360187 WTERMSIG 9
[2023-03-27T20:54:27.104] _job_complete: JobId=31360187 done


[2023-03-27T20:54:23.978] launch task StepId=31360187.0 request from 
UID:255 GID:100 HOST: PORT:59370
[2023-03-27T20:54:23.979] task/affinity: lllp_distribution: 
JobId=31360187 implicit auto binding: cores,one_thread, dist 1
[2023-03-27T20:54:23.979] task/affinity: _lllp_generate_cpu_bind: 
_lllp_generate_cpu_bind jobid [31360187]: mask_cpu,one_thread, 0x000008
[2023-03-27T20:54:24.236] [31360187.0] task/cgroup: _memcg_initialize: 
/slurm/uid_255/job_31360187: alloc=4096MB mem.limit=4096MB 
[2023-03-27T20:54:24.236] [31360187.0] task/cgroup: _memcg_initialize: 
/slurm/uid_255/job_31360187/step_0: alloc=4096MB mem.limit=4096MB 
[2023-03-27T20:54:27.038] [31360187.0] error: *** STEP 31360187.0 ON cl4 
CANCELLED AT 2023-03-27T20:54:27 ***
[2023-03-27T20:54:27.099] [31360187.0] done with job

srun output:

srun: job 31360187 queued and waiting for resources
srun: job 31360187 has been allocated resources
srun: jobid 31360187: nodes(1):`cl4', cpu counts: 1(x1)
srun: launching StepId=31360187.0 on host cl4, 1 tasks: 0
srun: launch/slurm: launch_p_step_launch: StepId=31360187.0 aborted 
before step completely launched.
srun: Complete StepId=31360187.0+0 received
slurmstepd: error: *** STEP 31360187.0 ON cl4 CANCELLED AT 
2023-03-27T20:54:27 ***
srun: launch/slurm: _task_finish: Received task exit notification for 1 
task of StepId=31360187.0 (status=0x0009).


# sacct -o jobid,elapsed,reason,state,exit -j 31360187
        JobID    Elapsed                 Reason      State ExitCode
------------ ---------- ---------------------- ---------- --------
31360187       00:00:11                   None     FAILED      0:9

These are compile jobs run via srun. The srun command is of this form 
(I've omitted the -I and -D parts as irrelevant and containing private 

( echo -n 'max=3126 ; printf "[%2d%% %${#max}d/3126] %s\n" `expr 2090 \* 
100 / 3126` 2090 "["c+11.2"] $(printf "[slurm %4s %s]" $(uname -n) 
$SLURM_JOB_ID) objectfile.o" ; fs_sync.sh sourcefile.cpp Makefile.flags 
; ' ; printf '%q ' g++ -MT objectfile.o -MMD -MP -MF optionfile.Td -m64 
-Werror -W -Wall -Wno-parentheses -Wno-unused-parameter 
-Wno-uninitialized -Wno-maybe-uninitialized  -Wno-misleading-indentation 
-Wno-implicit-fallthrough -std=c++20 -g -g2 ) | srun  -J rgrmake -p 
build -N 1 -n 1 -c 1 --quit-on-interrupt --mem=4gb --verbose bash  && 
fs_sync.sh objectfile.o

Slurm config:

Configuration data as of 2023-03-31T16:01:44
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = podarkes
AccountingStorageExternalHost = (null)
AccountingStorageParameters = (null)
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = No
AuthAltTypes            = (null)
AuthAltParameters       = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2023-02-21T10:02:56
BurstBufferType         = (null)
CliFilterPlugins        = (null)
ClusterName             = ri_cluster_v20
CommunicationParameters = (null)
CompleteWait            = 0 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand,UserSpace
CredType                = cred/munge
DebugFlags              = NO_CONF_HASH
DefMemPerNode           = UNLIMITED
DependencyParameters    = (null)
DisableRootJobs         = No
EioTimeout              = 60
EnforcePartLimits       = NO
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = (null)
GpuFreqDef              = high,memory=high
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Different Ours=0xf7a11381 Slurmctld=0x98e3b483
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
InteractiveStepOptions  = --interactive --preserve-env --pty $SHELL
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = (null)
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Licenses                = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 1001
MaxDBDMsgs              = 20112
MaxJobCount             = 10000
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MessageTimeout          = 60 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
NEXT_JOB_ID             = 31937596
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = (null)
PowerParameters         = (null)
PowerPlugin             =
PreemptMode             = GANG,SUSPEND
PreemptType             = preempt/partition_prio
PreemptExemptTime       = 00:02:00
PrEpParameters          = (null)
PrEpPlugins             = prep/script
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityType            = priority/basic
PrivateData             = none
ProctrackType           = proctrack/cgroup
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 2
RoutePlugin             = route/default
SbcastParameters        = (null)
SchedulerParameters     = 
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
ScronParameters         = (null)
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY
SlurmUser               = slurm(471)
SlurmctldAddr           = (null)
SlurmctldDebug          = info
SlurmctldHost[0]        = clctl1
SlurmctldLogFile        = /var/log/slurm/slurmctld.log
SlurmctldPort           = 6816-6817
SlurmctldSyslogDebug    = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 120 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = /var/log/slurm/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdSyslogDebug       = unknown
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 20.11.9
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /data/slurm/spool
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/affinity,task/cgroup
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 255
UsePam                  = No
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
X11Parameters           = (null)

Cgroup Support Configuration:
AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.conf
AllowedKmemSpace        = (null)
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupAutomount         = yes
CgroupMountpoint        = /cgroup
ConstrainCores          = yes
ConstrainDevices        = no
ConstrainKmemSpace      = no
ConstrainRAMSpace       = yes
ConstrainSwapSpace      = yes
MaxKmemPercent          = 100.0%
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = (null)
MinKmemSpace            = 30 MB
MinRAMSpace             = 30 MB
TaskAffinity            = no

Slurmctld(primary) at clctl1 is UP

Please let me know if any other information is needed to understand 
this. Any help is appreciated.

