[slurm-users] slurm-users Digest, Vol 66, Issue 6

Wed Apr 5 15:55:53 UTC 2023

Hi Doug,

Thanks much for the reply. I'm certain it's not an OOM, as we've looked 
for those in all the relevant /var/log/messages, and we do have our 
share of OOMs in this environment - we've configured Slurm to kill jobs 
that go over their defined memory limits, so we're familiar with what 
that looks like.

The engineer asserts not only that the process wasn't killed by him or 
by the calling process, he also claims that Slurm didn't run the job at 
all. I believe he thinks that because he didn't see output he was 
looking for, but, as we see in the logs, the job started and ran for a 
few seconds. My guess is that the job didn't last long enough to flush 
the stdout buffer.

The job in question is run from a process which is run from cron, so I 
believe that would rule out the possibility of the remote session closing.

My best information tells me that the job was started, ran for a few 
seconds until it realized it didn't have something it needed, and died, 
but I don't have enough insight to be sure. This srun message is the 
most perplexing to me, as I don't believe I've seen it before, and 
Googling turns up very little useful information:

srun: launch/slurm: launch_p_step_launch: StepId=31360187.0 aborted before step completely launched.

Have you ever seen this before?

Thanks,
-rob

On 4/4/23 21:17, slurm-users-request at lists.schedmd.com wrote:
> Send slurm-users mailing list submissions to
> 	slurm-users at lists.schedmd.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users
> or, via email, send a message with subject or body 'help' to
> 	slurm-users-request at lists.schedmd.com
>
> You can reach the person managing the list at
> 	slurm-users-owner at lists.schedmd.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of slurm-users digest..."
>
>
> Today's Topics:
>
>     1. Re: Job killed for unknown reason (Doug Meyer)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 4 Apr 2023 19:19:39 -0700
> From: Doug Meyer <dameyer99 at gmail.com>
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] Job killed for unknown reason
> Message-ID:
> 	<CAJvTnXLk54JCxDLgA9iJT0ss6wKhrcNv1hv3BtgCsKazjqGzEA at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi,
>
> I don't think I have ever seen a sig 9 that wasn't a user.  Is it possible
> you have folks in slurm coordinator/administrator that may be killing jobs
> or run running a cleanup script?  Only other thing I can think of is the
> user is closing their remote session before the srun completes. I can't
> recall right now but oom might be working.  dmesg -T | grep oom to see if
> the OS is wiping out jobs to recover memory.
>
> Doug
>
>
> On Mon, Apr 3, 2023, 8:56 AM Robert Barton <rob at realintent.com> wrote:
>
>> Hello,
>>
>> I'm looking for help in understanding a problem we're having such that
>> Slurm indicates that a job was killed, but not why. It's not clear what's
>> actually killing the jobs; we've seen jobs killed for time limits and
>> out-of-memory issues, and those reasons are obvious in the logs when they
>> happen, and that's not happening here.
>>
>> In Googling for the error messages, it seems like the jobs are killed
>> outside of Slurm, but the engineer insists that this is not the case.
>>
>> This happens sporadically, maybe every one or two million jobs, and is not
>> reliably reproducible. I'm looking for any ways to gather more information
>> about the cause of these issues.
>>
>> Slurm version: 20.11.9
>>
>> The relevant messages:
>>
>> slurmctld.log:
>>
>> [2023-03-27T20:53:55.336] sched: _slurm_rpc_allocate_resources
>> JobId=31360187 NodeList=(null) usec=5871
>> [2023-03-27T20:54:16.753] sched: Allocate JobId=31360187 NodeList=cl4
>> #CPUs=1 Partition=build
>> [2023-03-27T20:54:27.104] _job_complete: JobId=31360187 WTERMSIG 9
>> [2023-03-27T20:54:27.104] _job_complete: JobId=31360187 done
>>
>> slurmd.log:
>>
>> [2023-03-27T20:54:23.978] launch task StepId=31360187.0 request from
>> UID:255 GID:100 HOST:10.52.49.107 PORT:59370
>> [2023-03-27T20:54:23.979] task/affinity: lllp_distribution: JobId=31360187
>> implicit auto binding: cores,one_thread, dist 1
>> [2023-03-27T20:54:23.979] task/affinity: _lllp_generate_cpu_bind:
>> _lllp_generate_cpu_bind jobid [31360187]: mask_cpu,one_thread, 0x000008
>> [2023-03-27T20:54:24.236] [31360187.0] task/cgroup: _memcg_initialize:
>> /slurm/uid_255/job_31360187: alloc=4096MB mem.limit=4096MB
>> memsw.limit=4096MB
>> [2023-03-27T20:54:24.236] [31360187.0] task/cgroup: _memcg_initialize:
>> /slurm/uid_255/job_31360187/step_0: alloc=4096MB mem.limit=4096MB
>> memsw.limit=4096MB
>> [2023-03-27T20:54:27.038] [31360187.0] error: *** STEP 31360187.0 ON cl4
>> CANCELLED AT 2023-03-27T20:54:27 ***
>> [2023-03-27T20:54:27.099] [31360187.0] done with job
>>
>> srun output:
>>
>> srun: job 31360187 queued and waiting for resources
>> srun: job 31360187 has been allocated resources
>> srun: jobid 31360187: nodes(1):`cl4', cpu counts: 1(x1)
>> srun: launching StepId=31360187.0 on host cl4, 1 tasks: 0
>> srun: launch/slurm: launch_p_step_launch: StepId=31360187.0 aborted before
>> step completely launched.
>> srun: Complete StepId=31360187.0+0 received
>> slurmstepd: error: *** STEP 31360187.0 ON cl4 CANCELLED AT
>> 2023-03-27T20:54:27 ***
>> srun: launch/slurm: _task_finish: Received task exit notification for 1
>> task of StepId=31360187.0 (status=0x0009).
>>
>> accounting:
>>
>> # sacct -o jobid,elapsed,reason,state,exit -j 31360187
>>         JobID    Elapsed                 Reason      State ExitCode
>> ------------ ---------- ---------------------- ---------- --------
>> 31360187       00:00:11                   None     FAILED      0:9
>>
>>
>> These are compile jobs run via srun. The srun command is of this form
>> (I've omitted the -I and -D parts as irrelevant and containing private
>> information):
>>
>> ( echo -n 'max=3126 ; printf "[%2d%% %${#max}d/3126] %s\n" `expr 2090 \*
>> 100 / 3126` 2090 "["c+11.2"] $(printf "[slurm %4s %s]" $(uname -n)
>> $SLURM_JOB_ID) objectfile.o" ; fs_sync.sh sourcefile.cpp Makefile.flags ; '
>> ; printf '%q ' g++ -MT objectfile.o -MMD -MP -MF optionfile.Td -m64 -Werror
>> -W -Wall -Wno-parentheses -Wno-unused-parameter -Wno-uninitialized
>> -Wno-maybe-uninitialized  -Wno-misleading-indentation
>> -Wno-implicit-fallthrough -std=c++20 -g -g2 ) | srun  -J rgrmake -p build
>> -N 1 -n 1 -c 1 --quit-on-interrupt --mem=4gb --verbose bash  && fs_sync.sh
>> objectfile.o
>>
>>
>> Slurm config:
>>
>> Configuration data as of 2023-03-31T16:01:44
>> AccountingStorageBackupHost = (null)
>> AccountingStorageEnforce = none
>> AccountingStorageHost   = podarkes
>> AccountingStorageExternalHost = (null)
>> AccountingStorageParameters = (null)
>> AccountingStoragePort   = 6819
>> AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
>> AccountingStorageType   = accounting_storage/slurmdbd
>> AccountingStorageUser   = N/A
>> AccountingStoreJobComment = Yes
>> AcctGatherEnergyType    = acct_gather_energy/none
>> AcctGatherFilesystemType = acct_gather_filesystem/none
>> AcctGatherInterconnectType = acct_gather_interconnect/none
>> AcctGatherNodeFreq      = 0 sec
>> AcctGatherProfileType   = acct_gather_profile/none
>> AllowSpecResourcesUsage = No
>> AuthAltTypes            = (null)
>> AuthAltParameters       = (null)
>> AuthInfo                = (null)
>> AuthType                = auth/munge
>> BatchStartTimeout       = 10 sec
>> BOOT_TIME               = 2023-02-21T10:02:56
>> BurstBufferType         = (null)
>> CliFilterPlugins        = (null)
>> ClusterName             = ri_cluster_v20
>> CommunicationParameters = (null)
>> CompleteWait            = 0 sec
>> CoreSpecPlugin          = core_spec/none
>> CpuFreqDef              = Unknown
>> CpuFreqGovernors        = Performance,OnDemand,UserSpace
>> CredType                = cred/munge
>> DebugFlags              = NO_CONF_HASH
>> DefMemPerNode           = UNLIMITED
>> DependencyParameters    = (null)
>> DisableRootJobs         = No
>> EioTimeout              = 60
>> EnforcePartLimits       = NO
>> Epilog                  = (null)
>> EpilogMsgTime           = 2000 usec
>> EpilogSlurmctld         = (null)
>> ExtSensorsType          = ext_sensors/none
>> ExtSensorsFreq          = 0 sec
>> FederationParameters    = (null)
>> FirstJobId              = 1
>> GetEnvTimeout           = 2 sec
>> GresTypes               = (null)
>> GpuFreqDef              = high,memory=high
>> GroupUpdateForce        = 1
>> GroupUpdateTime         = 600 sec
>> HASH_VAL                = Different Ours=0xf7a11381 Slurmctld=0x98e3b483
>> HealthCheckInterval     = 0 sec
>> HealthCheckNodeState    = ANY
>> HealthCheckProgram      = (null)
>> InactiveLimit           = 0 sec
>> InteractiveStepOptions  = --interactive --preserve-env --pty $SHELL
>> JobAcctGatherFrequency  = 30
>> JobAcctGatherType       = jobacct_gather/linux
>> JobAcctGatherParams     = (null)
>> JobCompHost             = localhost
>> JobCompLoc              = /var/log/slurm_jobcomp.log
>> JobCompPort             = 0
>> JobCompType             = jobcomp/none
>> JobCompUser             = root
>> JobContainerType        = job_container/none
>> JobCredentialPrivateKey = (null)
>> JobCredentialPublicCertificate = (null)
>> JobDefaults             = (null)
>> JobFileAppend           = 0
>> JobRequeue              = 1
>> JobSubmitPlugins        = (null)
>> KeepAliveTime           = SYSTEM_DEFAULT
>> KillOnBadExit           = 0
>> KillWait                = 30 sec
>> LaunchParameters        = (null)
>> LaunchType              = launch/slurm
>> Licenses                = (null)
>> LogTimeFormat           = iso8601_ms
>> MailDomain              = (null)
>> MailProg                = /bin/mail
>> MaxArraySize            = 1001
>> MaxDBDMsgs              = 20112
>> MaxJobCount             = 10000
>> MaxJobId                = 67043328
>> MaxMemPerNode           = UNLIMITED
>> MaxStepCount            = 40000
>> MaxTasksPerNode         = 512
>> MCSPlugin               = mcs/none
>> MCSParameters           = (null)
>> MessageTimeout          = 60 sec
>> MinJobAge               = 300 sec
>> MpiDefault              = none
>> MpiParams               = (null)
>> NEXT_JOB_ID             = 31937596
>> NodeFeaturesPlugins     = (null)
>> OverTimeLimit           = 0 min
>> PluginDir               = /usr/lib64/slurm
>> PlugStackConfig         = (null)
>> PowerParameters         = (null)
>> PowerPlugin             =
>> PreemptMode             = GANG,SUSPEND
>> PreemptType             = preempt/partition_prio
>> PreemptExemptTime       = 00:02:00
>> PrEpParameters          = (null)
>> PrEpPlugins             = prep/script
>> PriorityParameters      = (null)
>> PrioritySiteFactorParameters = (null)
>> PrioritySiteFactorPlugin = (null)
>> PriorityType            = priority/basic
>> PrivateData             = none
>> ProctrackType           = proctrack/cgroup
>> Prolog                  = (null)
>> PrologEpilogTimeout     = 65534
>> PrologSlurmctld         = (null)
>> PrologFlags             = (null)
>> PropagatePrioProcess    = 0
>> PropagateResourceLimits = ALL
>> PropagateResourceLimitsExcept = (null)
>> RebootProgram           = (null)
>> ReconfigFlags           = (null)
>> RequeueExit             = (null)
>> RequeueExitHold         = (null)
>> ResumeFailProgram       = (null)
>> ResumeProgram           = (null)
>> ResumeRate              = 300 nodes/min
>> ResumeTimeout           = 60 sec
>> ResvEpilog              = (null)
>> ResvOverRun             = 0 min
>> ResvProlog              = (null)
>> ReturnToService         = 2
>> RoutePlugin             = route/default
>> SbcastParameters        = (null)
>> SchedulerParameters     =
>> batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_resolution=600,bf_yield_interval=1000000,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=2000000
>> SchedulerTimeSlice      = 30 sec
>> SchedulerType           = sched/backfill
>> ScronParameters         = (null)
>> SelectType              = select/cons_res
>> SelectTypeParameters    = CR_CORE_MEMORY
>> SlurmUser               = slurm(471)
>> SlurmctldAddr           = (null)
>> SlurmctldDebug          = info
>> SlurmctldHost[0]        = clctl1
>> SlurmctldLogFile        = /var/log/slurm/slurmctld.log
>> SlurmctldPort           = 6816-6817
>> SlurmctldSyslogDebug    = unknown
>> SlurmctldPrimaryOffProg = (null)
>> SlurmctldPrimaryOnProg  = (null)
>> SlurmctldTimeout        = 120 sec
>> SlurmctldParameters     = (null)
>> SlurmdDebug             = info
>> SlurmdLogFile           = /var/log/slurm/slurmd.log
>> SlurmdParameters        = (null)
>> SlurmdPidFile           = /var/run/slurmd.pid
>> SlurmdPort              = 6818
>> SlurmdSpoolDir          = /var/spool/slurmd
>> SlurmdSyslogDebug       = unknown
>> SlurmdTimeout           = 300 sec
>> SlurmdUser              = root(0)
>> SlurmSchedLogFile       = (null)
>> SlurmSchedLogLevel      = 0
>> SlurmctldPidFile        = /var/run/slurmctld.pid
>> SlurmctldPlugstack      = (null)
>> SLURM_CONF              = /etc/slurm/slurm.conf
>> SLURM_VERSION           = 20.11.9
>> SrunEpilog              = (null)
>> SrunPortRange           = 0-0
>> SrunProlog              = (null)
>> StateSaveLocation       = /data/slurm/spool
>> SuspendExcNodes         = (null)
>> SuspendExcParts         = (null)
>> SuspendProgram          = (null)
>> SuspendRate             = 60 nodes/min
>> SuspendTime             = NONE
>> SuspendTimeout          = 30 sec
>> SwitchType              = switch/none
>> TaskEpilog              = (null)
>> TaskPlugin              = task/affinity,task/cgroup
>> TaskPluginParam         = (null type)
>> TaskProlog              = (null)
>> TCPTimeout              = 2 sec
>> TmpFS                   = /tmp
>> TopologyParam           = (null)
>> TopologyPlugin          = topology/none
>> TrackWCKey              = No
>> TreeWidth               = 255
>> UsePam                  = No
>> UnkillableStepProgram   = (null)
>> UnkillableStepTimeout   = 60 sec
>> VSizeFactor             = 0 percent
>> WaitTime                = 0 sec
>> X11Parameters           = (null)
>>
>> Cgroup Support Configuration:
>> AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.conf
>> AllowedKmemSpace        = (null)
>> AllowedRAMSpace         = 100.0%
>> AllowedSwapSpace        = 0.0%
>> CgroupAutomount         = yes
>> CgroupMountpoint        = /cgroup
>> ConstrainCores          = yes
>> ConstrainDevices        = no
>> ConstrainKmemSpace      = no
>> ConstrainRAMSpace       = yes
>> ConstrainSwapSpace      = yes
>> MaxKmemPercent          = 100.0%
>> MaxRAMPercent           = 100.0%
>> MaxSwapPercent          = 100.0%
>> MemorySwappiness        = (null)
>> MinKmemSpace            = 30 MB
>> MinRAMSpace             = 30 MB
>> TaskAffinity            = no
>>
>> Slurmctld(primary) at clctl1 is UP
>>
>>
>> Please let me know if any other information is needed to understand this.
>> Any help is appreciated.
>>
>> Thanks,
>> -rob
>>
>>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230404/a93da655/attachment.htm>
>
> End of slurm-users Digest, Vol 66, Issue 6
> ******************************************