[slurm-users] Job killed for unknown reason
Doug Meyer
dameyer99 at gmail.com
Wed Apr 5 02:19:39 UTC 2023
Hi,
I don't think I have ever seen a sig 9 that wasn't a user. Is it possible
you have folks in slurm coordinator/administrator that may be killing jobs
or run running a cleanup script? Only other thing I can think of is the
user is closing their remote session before the srun completes. I can't
recall right now but oom might be working. dmesg -T | grep oom to see if
the OS is wiping out jobs to recover memory.
Doug
On Mon, Apr 3, 2023, 8:56 AM Robert Barton <rob at realintent.com> wrote:
> Hello,
>
> I'm looking for help in understanding a problem we're having such that
> Slurm indicates that a job was killed, but not why. It's not clear what's
> actually killing the jobs; we've seen jobs killed for time limits and
> out-of-memory issues, and those reasons are obvious in the logs when they
> happen, and that's not happening here.
>
> In Googling for the error messages, it seems like the jobs are killed
> outside of Slurm, but the engineer insists that this is not the case.
>
> This happens sporadically, maybe every one or two million jobs, and is not
> reliably reproducible. I'm looking for any ways to gather more information
> about the cause of these issues.
>
> Slurm version: 20.11.9
>
> The relevant messages:
>
> slurmctld.log:
>
> [2023-03-27T20:53:55.336] sched: _slurm_rpc_allocate_resources
> JobId=31360187 NodeList=(null) usec=5871
> [2023-03-27T20:54:16.753] sched: Allocate JobId=31360187 NodeList=cl4
> #CPUs=1 Partition=build
> [2023-03-27T20:54:27.104] _job_complete: JobId=31360187 WTERMSIG 9
> [2023-03-27T20:54:27.104] _job_complete: JobId=31360187 done
>
> slurmd.log:
>
> [2023-03-27T20:54:23.978] launch task StepId=31360187.0 request from
> UID:255 GID:100 HOST:10.52.49.107 PORT:59370
> [2023-03-27T20:54:23.979] task/affinity: lllp_distribution: JobId=31360187
> implicit auto binding: cores,one_thread, dist 1
> [2023-03-27T20:54:23.979] task/affinity: _lllp_generate_cpu_bind:
> _lllp_generate_cpu_bind jobid [31360187]: mask_cpu,one_thread, 0x000008
> [2023-03-27T20:54:24.236] [31360187.0] task/cgroup: _memcg_initialize:
> /slurm/uid_255/job_31360187: alloc=4096MB mem.limit=4096MB
> memsw.limit=4096MB
> [2023-03-27T20:54:24.236] [31360187.0] task/cgroup: _memcg_initialize:
> /slurm/uid_255/job_31360187/step_0: alloc=4096MB mem.limit=4096MB
> memsw.limit=4096MB
> [2023-03-27T20:54:27.038] [31360187.0] error: *** STEP 31360187.0 ON cl4
> CANCELLED AT 2023-03-27T20:54:27 ***
> [2023-03-27T20:54:27.099] [31360187.0] done with job
>
> srun output:
>
> srun: job 31360187 queued and waiting for resources
> srun: job 31360187 has been allocated resources
> srun: jobid 31360187: nodes(1):`cl4', cpu counts: 1(x1)
> srun: launching StepId=31360187.0 on host cl4, 1 tasks: 0
> srun: launch/slurm: launch_p_step_launch: StepId=31360187.0 aborted before
> step completely launched.
> srun: Complete StepId=31360187.0+0 received
> slurmstepd: error: *** STEP 31360187.0 ON cl4 CANCELLED AT
> 2023-03-27T20:54:27 ***
> srun: launch/slurm: _task_finish: Received task exit notification for 1
> task of StepId=31360187.0 (status=0x0009).
>
> accounting:
>
> # sacct -o jobid,elapsed,reason,state,exit -j 31360187
> JobID Elapsed Reason State ExitCode
> ------------ ---------- ---------------------- ---------- --------
> 31360187 00:00:11 None FAILED 0:9
>
>
> These are compile jobs run via srun. The srun command is of this form
> (I've omitted the -I and -D parts as irrelevant and containing private
> information):
>
> ( echo -n 'max=3126 ; printf "[%2d%% %${#max}d/3126] %s\n" `expr 2090 \*
> 100 / 3126` 2090 "["c+11.2"] $(printf "[slurm %4s %s]" $(uname -n)
> $SLURM_JOB_ID) objectfile.o" ; fs_sync.sh sourcefile.cpp Makefile.flags ; '
> ; printf '%q ' g++ -MT objectfile.o -MMD -MP -MF optionfile.Td -m64 -Werror
> -W -Wall -Wno-parentheses -Wno-unused-parameter -Wno-uninitialized
> -Wno-maybe-uninitialized -Wno-misleading-indentation
> -Wno-implicit-fallthrough -std=c++20 -g -g2 ) | srun -J rgrmake -p build
> -N 1 -n 1 -c 1 --quit-on-interrupt --mem=4gb --verbose bash && fs_sync.sh
> objectfile.o
>
>
> Slurm config:
>
> Configuration data as of 2023-03-31T16:01:44
> AccountingStorageBackupHost = (null)
> AccountingStorageEnforce = none
> AccountingStorageHost = podarkes
> AccountingStorageExternalHost = (null)
> AccountingStorageParameters = (null)
> AccountingStoragePort = 6819
> AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages
> AccountingStorageType = accounting_storage/slurmdbd
> AccountingStorageUser = N/A
> AccountingStoreJobComment = Yes
> AcctGatherEnergyType = acct_gather_energy/none
> AcctGatherFilesystemType = acct_gather_filesystem/none
> AcctGatherInterconnectType = acct_gather_interconnect/none
> AcctGatherNodeFreq = 0 sec
> AcctGatherProfileType = acct_gather_profile/none
> AllowSpecResourcesUsage = No
> AuthAltTypes = (null)
> AuthAltParameters = (null)
> AuthInfo = (null)
> AuthType = auth/munge
> BatchStartTimeout = 10 sec
> BOOT_TIME = 2023-02-21T10:02:56
> BurstBufferType = (null)
> CliFilterPlugins = (null)
> ClusterName = ri_cluster_v20
> CommunicationParameters = (null)
> CompleteWait = 0 sec
> CoreSpecPlugin = core_spec/none
> CpuFreqDef = Unknown
> CpuFreqGovernors = Performance,OnDemand,UserSpace
> CredType = cred/munge
> DebugFlags = NO_CONF_HASH
> DefMemPerNode = UNLIMITED
> DependencyParameters = (null)
> DisableRootJobs = No
> EioTimeout = 60
> EnforcePartLimits = NO
> Epilog = (null)
> EpilogMsgTime = 2000 usec
> EpilogSlurmctld = (null)
> ExtSensorsType = ext_sensors/none
> ExtSensorsFreq = 0 sec
> FederationParameters = (null)
> FirstJobId = 1
> GetEnvTimeout = 2 sec
> GresTypes = (null)
> GpuFreqDef = high,memory=high
> GroupUpdateForce = 1
> GroupUpdateTime = 600 sec
> HASH_VAL = Different Ours=0xf7a11381 Slurmctld=0x98e3b483
> HealthCheckInterval = 0 sec
> HealthCheckNodeState = ANY
> HealthCheckProgram = (null)
> InactiveLimit = 0 sec
> InteractiveStepOptions = --interactive --preserve-env --pty $SHELL
> JobAcctGatherFrequency = 30
> JobAcctGatherType = jobacct_gather/linux
> JobAcctGatherParams = (null)
> JobCompHost = localhost
> JobCompLoc = /var/log/slurm_jobcomp.log
> JobCompPort = 0
> JobCompType = jobcomp/none
> JobCompUser = root
> JobContainerType = job_container/none
> JobCredentialPrivateKey = (null)
> JobCredentialPublicCertificate = (null)
> JobDefaults = (null)
> JobFileAppend = 0
> JobRequeue = 1
> JobSubmitPlugins = (null)
> KeepAliveTime = SYSTEM_DEFAULT
> KillOnBadExit = 0
> KillWait = 30 sec
> LaunchParameters = (null)
> LaunchType = launch/slurm
> Licenses = (null)
> LogTimeFormat = iso8601_ms
> MailDomain = (null)
> MailProg = /bin/mail
> MaxArraySize = 1001
> MaxDBDMsgs = 20112
> MaxJobCount = 10000
> MaxJobId = 67043328
> MaxMemPerNode = UNLIMITED
> MaxStepCount = 40000
> MaxTasksPerNode = 512
> MCSPlugin = mcs/none
> MCSParameters = (null)
> MessageTimeout = 60 sec
> MinJobAge = 300 sec
> MpiDefault = none
> MpiParams = (null)
> NEXT_JOB_ID = 31937596
> NodeFeaturesPlugins = (null)
> OverTimeLimit = 0 min
> PluginDir = /usr/lib64/slurm
> PlugStackConfig = (null)
> PowerParameters = (null)
> PowerPlugin =
> PreemptMode = GANG,SUSPEND
> PreemptType = preempt/partition_prio
> PreemptExemptTime = 00:02:00
> PrEpParameters = (null)
> PrEpPlugins = prep/script
> PriorityParameters = (null)
> PrioritySiteFactorParameters = (null)
> PrioritySiteFactorPlugin = (null)
> PriorityType = priority/basic
> PrivateData = none
> ProctrackType = proctrack/cgroup
> Prolog = (null)
> PrologEpilogTimeout = 65534
> PrologSlurmctld = (null)
> PrologFlags = (null)
> PropagatePrioProcess = 0
> PropagateResourceLimits = ALL
> PropagateResourceLimitsExcept = (null)
> RebootProgram = (null)
> ReconfigFlags = (null)
> RequeueExit = (null)
> RequeueExitHold = (null)
> ResumeFailProgram = (null)
> ResumeProgram = (null)
> ResumeRate = 300 nodes/min
> ResumeTimeout = 60 sec
> ResvEpilog = (null)
> ResvOverRun = 0 min
> ResvProlog = (null)
> ReturnToService = 2
> RoutePlugin = route/default
> SbcastParameters = (null)
> SchedulerParameters =
> batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_resolution=600,bf_yield_interval=1000000,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=2000000
> SchedulerTimeSlice = 30 sec
> SchedulerType = sched/backfill
> ScronParameters = (null)
> SelectType = select/cons_res
> SelectTypeParameters = CR_CORE_MEMORY
> SlurmUser = slurm(471)
> SlurmctldAddr = (null)
> SlurmctldDebug = info
> SlurmctldHost[0] = clctl1
> SlurmctldLogFile = /var/log/slurm/slurmctld.log
> SlurmctldPort = 6816-6817
> SlurmctldSyslogDebug = unknown
> SlurmctldPrimaryOffProg = (null)
> SlurmctldPrimaryOnProg = (null)
> SlurmctldTimeout = 120 sec
> SlurmctldParameters = (null)
> SlurmdDebug = info
> SlurmdLogFile = /var/log/slurm/slurmd.log
> SlurmdParameters = (null)
> SlurmdPidFile = /var/run/slurmd.pid
> SlurmdPort = 6818
> SlurmdSpoolDir = /var/spool/slurmd
> SlurmdSyslogDebug = unknown
> SlurmdTimeout = 300 sec
> SlurmdUser = root(0)
> SlurmSchedLogFile = (null)
> SlurmSchedLogLevel = 0
> SlurmctldPidFile = /var/run/slurmctld.pid
> SlurmctldPlugstack = (null)
> SLURM_CONF = /etc/slurm/slurm.conf
> SLURM_VERSION = 20.11.9
> SrunEpilog = (null)
> SrunPortRange = 0-0
> SrunProlog = (null)
> StateSaveLocation = /data/slurm/spool
> SuspendExcNodes = (null)
> SuspendExcParts = (null)
> SuspendProgram = (null)
> SuspendRate = 60 nodes/min
> SuspendTime = NONE
> SuspendTimeout = 30 sec
> SwitchType = switch/none
> TaskEpilog = (null)
> TaskPlugin = task/affinity,task/cgroup
> TaskPluginParam = (null type)
> TaskProlog = (null)
> TCPTimeout = 2 sec
> TmpFS = /tmp
> TopologyParam = (null)
> TopologyPlugin = topology/none
> TrackWCKey = No
> TreeWidth = 255
> UsePam = No
> UnkillableStepProgram = (null)
> UnkillableStepTimeout = 60 sec
> VSizeFactor = 0 percent
> WaitTime = 0 sec
> X11Parameters = (null)
>
> Cgroup Support Configuration:
> AllowedDevicesFile = /etc/slurm/cgroup_allowed_devices_file.conf
> AllowedKmemSpace = (null)
> AllowedRAMSpace = 100.0%
> AllowedSwapSpace = 0.0%
> CgroupAutomount = yes
> CgroupMountpoint = /cgroup
> ConstrainCores = yes
> ConstrainDevices = no
> ConstrainKmemSpace = no
> ConstrainRAMSpace = yes
> ConstrainSwapSpace = yes
> MaxKmemPercent = 100.0%
> MaxRAMPercent = 100.0%
> MaxSwapPercent = 100.0%
> MemorySwappiness = (null)
> MinKmemSpace = 30 MB
> MinRAMSpace = 30 MB
> TaskAffinity = no
>
> Slurmctld(primary) at clctl1 is UP
>
>
> Please let me know if any other information is needed to understand this.
> Any help is appreciated.
>
> Thanks,
> -rob
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230404/a93da655/attachment-0001.htm>
More information about the slurm-users
mailing list