[slurm-users] Job killed for unknown reason

Wed Apr 5 02:19:39 UTC 2023

Hi,

I don't think I have ever seen a sig 9 that wasn't a user.  Is it possible
you have folks in slurm coordinator/administrator that may be killing jobs
or run running a cleanup script?  Only other thing I can think of is the
user is closing their remote session before the srun completes. I can't
recall right now but oom might be working.  dmesg -T | grep oom to see if
the OS is wiping out jobs to recover memory.

Doug

On Mon, Apr 3, 2023, 8:56 AM Robert Barton <rob at realintent.com> wrote:

> Hello,
>
> I'm looking for help in understanding a problem we're having such that
> Slurm indicates that a job was killed, but not why. It's not clear what's
> actually killing the jobs; we've seen jobs killed for time limits and
> out-of-memory issues, and those reasons are obvious in the logs when they
> happen, and that's not happening here.
>
> In Googling for the error messages, it seems like the jobs are killed
> outside of Slurm, but the engineer insists that this is not the case.
>
> This happens sporadically, maybe every one or two million jobs, and is not
> reliably reproducible. I'm looking for any ways to gather more information
> about the cause of these issues.
>
> Slurm version: 20.11.9
>
> The relevant messages:
>
> slurmctld.log:
>
> [2023-03-27T20:53:55.336] sched: _slurm_rpc_allocate_resources
> JobId=31360187 NodeList=(null) usec=5871
> [2023-03-27T20:54:16.753] sched: Allocate JobId=31360187 NodeList=cl4
> #CPUs=1 Partition=build
> [2023-03-27T20:54:27.104] _job_complete: JobId=31360187 WTERMSIG 9
> [2023-03-27T20:54:27.104] _job_complete: JobId=31360187 done
>
> slurmd.log:
>
> [2023-03-27T20:54:23.978] launch task StepId=31360187.0 request from
> UID:255 GID:100 HOST:10.52.49.107 PORT:59370
> [2023-03-27T20:54:23.979] task/affinity: lllp_distribution: JobId=31360187
> implicit auto binding: cores,one_thread, dist 1
> [2023-03-27T20:54:23.979] task/affinity: _lllp_generate_cpu_bind:
> _lllp_generate_cpu_bind jobid [31360187]: mask_cpu,one_thread, 0x000008
> [2023-03-27T20:54:24.236] [31360187.0] task/cgroup: _memcg_initialize:
> /slurm/uid_255/job_31360187: alloc=4096MB mem.limit=4096MB
> memsw.limit=4096MB
> [2023-03-27T20:54:24.236] [31360187.0] task/cgroup: _memcg_initialize:
> /slurm/uid_255/job_31360187/step_0: alloc=4096MB mem.limit=4096MB
> memsw.limit=4096MB
> [2023-03-27T20:54:27.038] [31360187.0] error: *** STEP 31360187.0 ON cl4
> CANCELLED AT 2023-03-27T20:54:27 ***
> [2023-03-27T20:54:27.099] [31360187.0] done with job
>
> srun output:
>
> srun: job 31360187 queued and waiting for resources
> srun: job 31360187 has been allocated resources
> srun: jobid 31360187: nodes(1):`cl4', cpu counts: 1(x1)
> srun: launching StepId=31360187.0 on host cl4, 1 tasks: 0
> srun: launch/slurm: launch_p_step_launch: StepId=31360187.0 aborted before
> step completely launched.
> srun: Complete StepId=31360187.0+0 received
> slurmstepd: error: *** STEP 31360187.0 ON cl4 CANCELLED AT
> 2023-03-27T20:54:27 ***
> srun: launch/slurm: _task_finish: Received task exit notification for 1
> task of StepId=31360187.0 (status=0x0009).
>
> accounting:
>
> # sacct -o jobid,elapsed,reason,state,exit -j 31360187
>        JobID    Elapsed                 Reason      State ExitCode
> ------------ ---------- ---------------------- ---------- --------
> 31360187       00:00:11                   None     FAILED      0:9
>
>
> These are compile jobs run via srun. The srun command is of this form
> (I've omitted the -I and -D parts as irrelevant and containing private
> information):
>
> ( echo -n 'max=3126 ; printf "[%2d%% %${#max}d/3126] %s\n" `expr 2090 \*
> 100 / 3126` 2090 "["c+11.2"] $(printf "[slurm %4s %s]" $(uname -n)
> $SLURM_JOB_ID) objectfile.o" ; fs_sync.sh sourcefile.cpp Makefile.flags ; '
> ; printf '%q ' g++ -MT objectfile.o -MMD -MP -MF optionfile.Td -m64 -Werror
> -W -Wall -Wno-parentheses -Wno-unused-parameter -Wno-uninitialized
> -Wno-maybe-uninitialized  -Wno-misleading-indentation
> -Wno-implicit-fallthrough -std=c++20 -g -g2 ) | srun  -J rgrmake -p build
> -N 1 -n 1 -c 1 --quit-on-interrupt --mem=4gb --verbose bash  && fs_sync.sh
> objectfile.o
>
>
> Slurm config:
>
> Configuration data as of 2023-03-31T16:01:44
> AccountingStorageBackupHost = (null)
> AccountingStorageEnforce = none
> AccountingStorageHost   = podarkes
> AccountingStorageExternalHost = (null)
> AccountingStorageParameters = (null)
> AccountingStoragePort   = 6819
> AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
> AccountingStorageType   = accounting_storage/slurmdbd
> AccountingStorageUser   = N/A
> AccountingStoreJobComment = Yes
> AcctGatherEnergyType    = acct_gather_energy/none
> AcctGatherFilesystemType = acct_gather_filesystem/none
> AcctGatherInterconnectType = acct_gather_interconnect/none
> AcctGatherNodeFreq      = 0 sec
> AcctGatherProfileType   = acct_gather_profile/none
> AllowSpecResourcesUsage = No
> AuthAltTypes            = (null)
> AuthAltParameters       = (null)
> AuthInfo                = (null)
> AuthType                = auth/munge
> BatchStartTimeout       = 10 sec
> BOOT_TIME               = 2023-02-21T10:02:56
> BurstBufferType         = (null)
> CliFilterPlugins        = (null)
> ClusterName             = ri_cluster_v20
> CommunicationParameters = (null)
> CompleteWait            = 0 sec
> CoreSpecPlugin          = core_spec/none
> CpuFreqDef              = Unknown
> CpuFreqGovernors        = Performance,OnDemand,UserSpace
> CredType                = cred/munge
> DebugFlags              = NO_CONF_HASH
> DefMemPerNode           = UNLIMITED
> DependencyParameters    = (null)
> DisableRootJobs         = No
> EioTimeout              = 60
> EnforcePartLimits       = NO
> Epilog                  = (null)
> EpilogMsgTime           = 2000 usec
> EpilogSlurmctld         = (null)
> ExtSensorsType          = ext_sensors/none
> ExtSensorsFreq          = 0 sec
> FederationParameters    = (null)
> FirstJobId              = 1
> GetEnvTimeout           = 2 sec
> GresTypes               = (null)
> GpuFreqDef              = high,memory=high
> GroupUpdateForce        = 1
> GroupUpdateTime         = 600 sec
> HASH_VAL                = Different Ours=0xf7a11381 Slurmctld=0x98e3b483
> HealthCheckInterval     = 0 sec
> HealthCheckNodeState    = ANY
> HealthCheckProgram      = (null)
> InactiveLimit           = 0 sec
> InteractiveStepOptions  = --interactive --preserve-env --pty $SHELL
> JobAcctGatherFrequency  = 30
> JobAcctGatherType       = jobacct_gather/linux
> JobAcctGatherParams     = (null)
> JobCompHost             = localhost
> JobCompLoc              = /var/log/slurm_jobcomp.log
> JobCompPort             = 0
> JobCompType             = jobcomp/none
> JobCompUser             = root
> JobContainerType        = job_container/none
> JobCredentialPrivateKey = (null)
> JobCredentialPublicCertificate = (null)
> JobDefaults             = (null)
> JobFileAppend           = 0
> JobRequeue              = 1
> JobSubmitPlugins        = (null)
> KeepAliveTime           = SYSTEM_DEFAULT
> KillOnBadExit           = 0
> KillWait                = 30 sec
> LaunchParameters        = (null)
> LaunchType              = launch/slurm
> Licenses                = (null)
> LogTimeFormat           = iso8601_ms
> MailDomain              = (null)
> MailProg                = /bin/mail
> MaxArraySize            = 1001
> MaxDBDMsgs              = 20112
> MaxJobCount             = 10000
> MaxJobId                = 67043328
> MaxMemPerNode           = UNLIMITED
> MaxStepCount            = 40000
> MaxTasksPerNode         = 512
> MCSPlugin               = mcs/none
> MCSParameters           = (null)
> MessageTimeout          = 60 sec
> MinJobAge               = 300 sec
> MpiDefault              = none
> MpiParams               = (null)
> NEXT_JOB_ID             = 31937596
> NodeFeaturesPlugins     = (null)
> OverTimeLimit           = 0 min
> PluginDir               = /usr/lib64/slurm
> PlugStackConfig         = (null)
> PowerParameters         = (null)
> PowerPlugin             =
> PreemptMode             = GANG,SUSPEND
> PreemptType             = preempt/partition_prio
> PreemptExemptTime       = 00:02:00
> PrEpParameters          = (null)
> PrEpPlugins             = prep/script
> PriorityParameters      = (null)
> PrioritySiteFactorParameters = (null)
> PrioritySiteFactorPlugin = (null)
> PriorityType            = priority/basic
> PrivateData             = none
> ProctrackType           = proctrack/cgroup
> Prolog                  = (null)
> PrologEpilogTimeout     = 65534
> PrologSlurmctld         = (null)
> PrologFlags             = (null)
> PropagatePrioProcess    = 0
> PropagateResourceLimits = ALL
> PropagateResourceLimitsExcept = (null)
> RebootProgram           = (null)
> ReconfigFlags           = (null)
> RequeueExit             = (null)
> RequeueExitHold         = (null)
> ResumeFailProgram       = (null)
> ResumeProgram           = (null)
> ResumeRate              = 300 nodes/min
> ResumeTimeout           = 60 sec
> ResvEpilog              = (null)
> ResvOverRun             = 0 min
> ResvProlog              = (null)
> ReturnToService         = 2
> RoutePlugin             = route/default
> SbcastParameters        = (null)
> SchedulerParameters     =
> batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_resolution=600,bf_yield_interval=1000000,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=2000000
> SchedulerTimeSlice      = 30 sec
> SchedulerType           = sched/backfill
> ScronParameters         = (null)
> SelectType              = select/cons_res
> SelectTypeParameters    = CR_CORE_MEMORY
> SlurmUser               = slurm(471)
> SlurmctldAddr           = (null)
> SlurmctldDebug          = info
> SlurmctldHost[0]        = clctl1
> SlurmctldLogFile        = /var/log/slurm/slurmctld.log
> SlurmctldPort           = 6816-6817
> SlurmctldSyslogDebug    = unknown
> SlurmctldPrimaryOffProg = (null)
> SlurmctldPrimaryOnProg  = (null)
> SlurmctldTimeout        = 120 sec
> SlurmctldParameters     = (null)
> SlurmdDebug             = info
> SlurmdLogFile           = /var/log/slurm/slurmd.log
> SlurmdParameters        = (null)
> SlurmdPidFile           = /var/run/slurmd.pid
> SlurmdPort              = 6818
> SlurmdSpoolDir          = /var/spool/slurmd
> SlurmdSyslogDebug       = unknown
> SlurmdTimeout           = 300 sec
> SlurmdUser              = root(0)
> SlurmSchedLogFile       = (null)
> SlurmSchedLogLevel      = 0
> SlurmctldPidFile        = /var/run/slurmctld.pid
> SlurmctldPlugstack      = (null)
> SLURM_CONF              = /etc/slurm/slurm.conf
> SLURM_VERSION           = 20.11.9
> SrunEpilog              = (null)
> SrunPortRange           = 0-0
> SrunProlog              = (null)
> StateSaveLocation       = /data/slurm/spool
> SuspendExcNodes         = (null)
> SuspendExcParts         = (null)
> SuspendProgram          = (null)
> SuspendRate             = 60 nodes/min
> SuspendTime             = NONE
> SuspendTimeout          = 30 sec
> SwitchType              = switch/none
> TaskEpilog              = (null)
> TaskPlugin              = task/affinity,task/cgroup
> TaskPluginParam         = (null type)
> TaskProlog              = (null)
> TCPTimeout              = 2 sec
> TmpFS                   = /tmp
> TopologyParam           = (null)
> TopologyPlugin          = topology/none
> TrackWCKey              = No
> TreeWidth               = 255
> UsePam                  = No
> UnkillableStepProgram   = (null)
> UnkillableStepTimeout   = 60 sec
> VSizeFactor             = 0 percent
> WaitTime                = 0 sec
> X11Parameters           = (null)
>
> Cgroup Support Configuration:
> AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.conf
> AllowedKmemSpace        = (null)
> AllowedRAMSpace         = 100.0%
> AllowedSwapSpace        = 0.0%
> CgroupAutomount         = yes
> CgroupMountpoint        = /cgroup
> ConstrainCores          = yes
> ConstrainDevices        = no
> ConstrainKmemSpace      = no
> ConstrainRAMSpace       = yes
> ConstrainSwapSpace      = yes
> MaxKmemPercent          = 100.0%
> MaxRAMPercent           = 100.0%
> MaxSwapPercent          = 100.0%
> MemorySwappiness        = (null)
> MinKmemSpace            = 30 MB
> MinRAMSpace             = 30 MB
> TaskAffinity            = no
>
> Slurmctld(primary) at clctl1 is UP
>
>
> Please let me know if any other information is needed to understand this.
> Any help is appreciated.
>
> Thanks,
> -rob
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230404/a93da655/attachment-0001.htm>