The following log snippet shows a job that will put the node in DRAIN state with reason "batch job complete failure" (...) [2022-04-05T02:59:57.980] [4302857.batch] debug3: Couldn't find sym 'slurm_spank_slurmd_exit' in the plugin [2022-04-05T02:59:57.980] [4302857.batch] debug: spank: /etc/slurm/plugstack.conf:35: Loaded plugin use-env.so [2022-04-05T02:59:57.980] [4302857.batch] debug: SPANK: appending plugin option "use-env" [2022-04-05T02:59:57.980] [4302857.batch] debug2: spank: private-tmpdir.so: init = 0 [2022-04-05T02:59:57.980] [4302857.batch] debug2: spank: use-env.so: init = 0 [2022-04-05T02:59:57.981] [4302857.batch] debug: private-tmpdir: mounting: /scratch/slurm.4302857.0/tmp /tmp [2022-04-05T02:59:57.981] [4302857.batch] debug2: spank: private-tmpdir.so: init_post_opt = 0 [2022-04-05T02:59:57.981] [4302857.batch] debug2: After call to spank_init() [2022-04-05T02:59:57.981] [4302857.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'cgroup.clone_children' set to '0' for '/sys/fs/cgroup/cpuset/slurm' [2022-04-05T02:59:57.983] [4302857.batch] error: common_cgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/cpuset/slurm/uid_43197/job_4302857' : No such file or directory [2022-04-05T02:59:57.983] [4302857.batch] error: _cpuset_create: unable to instantiate job 4302857 cgroup [2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory/slurm/uid_43197' [2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory/slurm/uid_43197/job_4302857' [2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory/slurm/uid_43197/job_4302857/step_batch' [2022-04-05T02:59:58.008] [4302857.batch] task/cgroup: _memcg_initialize: job: alloc=3072MB mem.limit=3072MB memsw.limit=3072MB [2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1: common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter 'memory.limit_in_bytes' set to '3221225472' for '/sys/fs/cgroup/memory/slurm/uid_43197/job_4302857' [2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1: common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter 'memory.soft_limit_in_bytes' set to '3221225472' for '/sys/fs/cgroup/memory/slurm/uid_43197/job_4302857' [2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1: common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter 'memory.memsw.limit_in_bytes' set to '3221225472' for '/sys/fs/cgroup/memory/slurm/uid_43197/job_4302857' [2022-04-05T02:59:58.008] [4302857.batch] task/cgroup: _memcg_initialize: step: alloc=3072MB mem.limit=3072MB memsw.limit=3072MB [2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1: common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter 'memory.limit_in_bytes' set to '3221225472' for '/sys/fs/cgroup/memory/slurm/uid_43197/job_4302857/step_batch' [2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1: common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter 'memory.soft_limit_in_bytes' set to '3221225472' for '/sys/fs/cgroup/memory/slurm/uid_43197/job_4302857/step_batch' [2022-04-05T02:59:58.008] [4302857.batch] debug3: cgroup/v1: common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter 'memory.memsw.limit_in_bytes' set to '3221225472' for '/sys/fs/cgroup/memory/slurm/uid_43197/job_4302857/step_batch' [2022-04-05T02:59:58.009] [4302857.batch] debug: cgroup/v1: _oom_event_monitor: started. [2022-04-05T02:59:58.009] [4302857.batch] debug: task_g_pre_setuid: task/cgroup: Unspecified error [2022-04-05T02:59:58.009] [4302857.batch] error: Failed to invoke task plugins: one of task_p_pre_setuid functions returned error [2022-04-05T02:59:58.009] [4302857.batch] debug: _fork_all_tasks failed [2022-04-05T02:59:58.009] [4302857.batch] debug2: step_terminate_monitor will run for 120 secs [2022-04-05T02:59:58.009] [4302857.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'freezer.state' set to 'THAWED' for '/sys/fs/cgroup/freezer/slurm/uid_43197/job_4302857/step_batch' [2022-04-05T02:59:58.009] [4302857.batch] debug: signaling condition [2022-04-05T02:59:58.009] [4302857.batch] debug2: step_terminate_monitor is stopping [2022-04-05T02:59:58.009] [4302857.batch] debug2: _monitor exit code: 0 [2022-04-05T02:59:58.021] [4302857.batch] debug3: cgroup/v1: _oom_event_monitor: res: 1 [2022-04-05T02:59:58.021] [4302857.batch] debug: cgroup/v1: _oom_event_monitor: oom-kill event count: 1 [2022-04-05T02:59:58.040] [4302857.batch] error: called without a previous init. This shouldn't happen! [2022-04-05T02:59:58.040] [4302857.batch] debug: jobacct_gather/cgroup: fini: Job accounting gather cgroup plugin unloaded [2022-04-05T02:59:58.040] [4302857.batch] error: called without a previous init. This shouldn't happen! [2022-04-05T02:59:58.041] [4302857.batch] error: called without a previous init. This shouldn't happen! [2022-04-05T02:59:58.041] [4302857.batch] debug: task/cgroup: fini: Tasks containment cgroup plugin unloaded [2022-04-05T02:59:58.041] [4302857.batch] debug2: Before call to spank_fini() [2022-04-05T02:59:58.041] [4302857.batch] debug2: spank: private-tmpdir.so: exit = 0 [2022-04-05T02:59:58.041] [4302857.batch] debug2: spank: use-env.so: exit = 0 [2022-04-05T02:59:58.041] [4302857.batch] debug2: After call to spank_fini() [2022-04-05T02:59:58.041] [4302857.batch] error: job_manager: exiting abnormally: Slurmd could not execve job [2022-04-05T02:59:58.041] [4302857.batch] job 4302857 completed with slurm_rc = 4020, job_rc = 0 [2022-04-05T02:59:58.041] [4302857.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status:0 [2022-04-05T02:59:59.405] [4302857.batch] debug3: Called _msg_socket_readable [2022-04-05T02:59:59.405] [4302857.batch] debug2: false, shutdown [2022-04-05T02:59:59.406] [4302857.batch] debug: Message thread exited [2022-04-05T02:59:59.406] [4302857.batch] done with job (...) As a reference, the following log snippet show a job that start without any issue. (...) [2022-04-05T02:43:28.220] [4304882.batch] debug3: Couldn't find sym 'slurm_spank_slurmd_exit' in the plugin [2022-04-05T02:43:28.220] [4304882.batch] debug: spank: /etc/slurm/plugstack.conf:35: Loaded plugin use-env.so [2022-04-05T02:43:28.220] [4304882.batch] debug: SPANK: appending plugin option "use-env" [2022-04-05T02:43:28.220] [4304882.batch] debug2: spank: private-tmpdir.so: init = 0 [2022-04-05T02:43:28.222] [4304882.batch] debug2: spank: use-env.so: init = 0 [2022-04-05T02:43:28.224] [4304882.batch] debug: private-tmpdir: mounting: /scratch/slurm.4304882.0/tmp /tmp [2022-04-05T02:43:28.224] [4304882.batch] debug2: spank: private-tmpdir.so: init_post_opt = 0 [2022-04-05T02:43:28.224] [4304882.batch] debug2: After call to spank_init() [2022-04-05T02:43:28.224] [4304882.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'cgroup.clone_children' set to '0' for '/sys/fs/cgroup/cpuset/slurm' [2022-04-05T02:43:28.224] [4304882.batch] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '20-21' [2022-04-05T02:43:28.224] [4304882.batch] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '20-21' [2022-04-05T02:43:28.224] [4304882.batch] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '10,42' [2022-04-05T02:43:28.224] [4304882.batch] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '10,42' [2022-04-05T02:43:28.224] [4304882.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.cpus' set to '10,42,0-63' for '/sys/fs/cgroup/cpuset/slurm/uid_930' [2022-04-05T02:43:28.224] [4304882.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.mems' set to '0-2,4-6' for '/sys/fs/cgroup/cpuset/slurm/uid_930' [2022-04-05T02:43:28.226] [4304882.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.cpus' set to '10,42' for '/sys/fs/cgroup/cpuset/slurm/uid_930/job_4304882' [2022-04-05T02:43:28.226] [4304882.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.mems' set to '0-2,4-6' for '/sys/fs/cgroup/cpuset/slurm/uid_930/job_4304882' [2022-04-05T02:43:28.227] [4304882.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.cpus' set to '10,42' for '/sys/fs/cgroup/cpuset/slurm/uid_930/job_4304882/step_batch' [2022-04-05T02:43:28.227] [4304882.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'cpuset.mems' set to '0-2,4-6' for '/sys/fs/cgroup/cpuset/slurm/uid_930/job_4304882/step_batch' [2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory/slurm/uid_930' [2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory/slurm/uid_930/job_4304882' [2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1: common_cgroup_set_param: common_cgroup_set_param: parameter 'memory.use_hierarchy' set to '1' for '/sys/fs/cgroup/memory/slurm/uid_930/job_4304882/step_batch' [2022-04-05T02:43:28.241] [4304882.batch] task/cgroup: _memcg_initialize: job: alloc=6144MB mem.limit=6144MB memsw.limit=6144MB [2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1: common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter 'memory.limit_in_bytes' set to '6442450944' for '/sys/fs/cgroup/memory/slurm/uid_930/job_4304882' [2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1: common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter 'memory.soft_limit_in_bytes' set to '6442450944' for '/sys/fs/cgroup/memory/slurm/uid_930/job_4304882' [2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1: common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter 'memory.memsw.limit_in_bytes' set to '6442450944' for '/sys/fs/cgroup/memory/slurm/uid_930/job_4304882' [2022-04-05T02:43:28.241] [4304882.batch] task/cgroup: _memcg_initialize: step: alloc=6144MB mem.limit=6144MB memsw.limit=6144MB [2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1: common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter 'memory.limit_in_bytes' set to '6442450944' for '/sys/fs/cgroup/memory/slurm/uid_930/job_4304882/step_batch' [2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1: common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter 'memory.soft_limit_in_bytes' set to '6442450944' for '/sys/fs/cgroup/memory/slurm/uid_930/job_4304882/step_batch' [2022-04-05T02:43:28.241] [4304882.batch] debug3: cgroup/v1: common_cgroup_set_uint64_param: common_cgroup_set_uint64_param: parameter 'memory.memsw.limit_in_bytes' set to '6442450944' for '/sys/fs/cgroup/memory/slurm/uid_930/job_4304882/step_batch' [2022-04-05T02:43:28.242] [4304882.batch] debug: cgroup/v1: _oom_event_monitor: started. [2022-04-05T02:43:28.242] [4304882.batch] debug2: hwloc_topology_load [2022-04-05T02:43:28.286] [4304882.batch] debug2: hwloc_topology_export_xml [2022-04-05T02:43:28.293] [4304882.batch] debug3: Called _msg_socket_readable [2022-04-05T02:43:28.293] [4304882.batch] debug3: Called _msg_socket_readable [2022-04-05T02:43:28.296] [4304882.batch] debug2: Entering _setup_normal_io [2022-04-05T02:43:28.296] [4304882.batch] debug3: Called _msg_socket_readable [2022-04-05T02:43:28.296] [4304882.batch] debug3: Called _msg_socket_readable [2022-04-05T02:43:28.296] [4304882.batch] debug3: Called _msg_socket_readable [2022-04-05T02:43:28.309] [4304882.batch] debug3: Called _msg_socket_readable [2022-04-05T02:43:28.309] [4304882.batch] debug3: Called _msg_socket_readable [2022-04-05T02:43:28.309] [4304882.batch] debug3: Called _msg_socket_readable [2022-04-05T02:43:28.309] [4304882.batch] debug2: Leaving _setup_normal_io [2022-04-05T02:43:28.309] [4304882.batch] debug levels are stderr='error', logfile='debug3', syslog='quiet' [2022-04-05T02:43:28.309] [4304882.batch] debug3: Called _msg_socket_readable [2022-04-05T02:43:28.309] [4304882.batch] debug3: Called _msg_socket_readable [2022-04-05T02:43:28.309] [4304882.batch] debug3: Called _msg_socket_readable [2022-04-05T02:43:28.309] [4304882.batch] starting 1 tasks [2022-04-05T02:43:28.310] [4304882.batch] task 0 (29705) started 2022-04-05T02:43:28 (...) "scontrol show config" output: Configuration data as of 2022-04-14T13:57:54 AccountingStorageBackupHost = (null) AccountingStorageEnforce = associations,limits,qos AccountingStorageHost = ... AccountingStorageExternalHost = (null) AccountingStorageParameters = (null) AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu,gres/gpu:k80,gres/gpu:v100 AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreFlags = job_comment AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = No AuthAltTypes = auth/jwt AuthAltParameters = jwt_key=/etc/slurm/jwt_hs256.key AuthInfo = (null) AuthType = auth/munge BatchStartTimeout = 10 sec BcastExclude = /lib,/usr/lib,/lib64,/usr/lib64 BcastParameters = (null) BOOT_TIME = 2022-04-11T14:26:48 BurstBufferType = (null) CliFilterPlugins = (null) ClusterName = ccslurmlocal CommunicationParameters = (null) CompleteWait = 0 sec CoreSpecPlugin = core_spec/none CpuFreqDef = Unknown CpuFreqGovernors = OnDemand,Performance,UserSpace CredType = cred/munge DebugFlags = CPU_Bind,Gres DefMemPerNode = UNLIMITED DependencyParameters = (null) DisableRootJobs = Yes EioTimeout = 60 EnforcePartLimits = NO Epilog = /etc/slurm/epilog.sh EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 1 FederationParameters = (null) FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = gpu GpuFreqDef = high,memory=high GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 0 sec InteractiveStepOptions = --interactive JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/cgroup JobAcctGatherParams = (null) JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/none JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobDefaults = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = lua KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 0 KillWait = 30 sec LaunchParameters = use_interactive_step LaunchType = launch/slurm Licenses = ... LogTimeFormat = iso8601_ms MailDomain = (null) MailProg = /bin/mail MaxArraySize = 1000001 MaxDBDMsgs = 20016 MaxJobCount = 40000 MaxJobId = 67043328 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 512 MCSPlugin = mcs/none MCSParameters = (null) MessageTimeout = 30 sec MinJobAge = 60 sec MpiDefault = none MpiParams = (null) NEXT_JOB_ID = 5670091 NodeFeaturesPlugins = (null) OverTimeLimit = 0 min PluginDir = /usr/lib64/slurm PlugStackConfig = (null) PowerParameters = (null) PowerPlugin = PreemptMode = OFF PreemptType = preempt/none PreemptExemptTime = 00:00:00 PrEpParameters = (null) PrEpPlugins = prep/script PriorityParameters = (null) PrioritySiteFactorParameters = (null) PrioritySiteFactorPlugin = (null) PriorityDecayHalfLife = 7-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = PriorityMaxAge = 4-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 100 PriorityWeightAssoc = 0 PriorityWeightFairShare = 1000 PriorityWeightJobSize = 0 PriorityWeightPartition = 0 PriorityWeightQOS = 10 PriorityWeightTRES = (null) PrivateData = accounts,events,jobs,reservations,usage,users ProctrackType = proctrack/cgroup Prolog = /etc/slurm/prolog.sh PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = Alloc PropagatePrioProcess = 0 PropagateResourceLimits = NONE PropagateResourceLimitsExcept = (null) RebootProgram = (null) ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeFailProgram = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 1 RoutePlugin = route/default SchedulerParameters = pack_serial_at_end,max_rpc_cnt=40 SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill ScronParameters = (null) SelectType = select/cons_tres SelectTypeParameters = CR_CPU_MEMORY SlurmUser = slurm(9912) SlurmctldAddr = (null) SlurmctldDebug = info SlurmctldHost[0] = ...01 SlurmctldHost[1] = ...02 SlurmctldLogFile = /var/log/slurm/slurmctld.log SlurmctldPort = 6817 SlurmctldSyslogDebug = unknown SlurmctldPrimaryOffProg = (null) SlurmctldPrimaryOnProg = (null) SlurmctldTimeout = 120 sec SlurmctldParameters = (null) SlurmdDebug = info SlurmdLogFile = /var/log/slurm/slurmd.log SlurmdParameters = (null) SlurmdPidFile = /var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurmd SlurmdSyslogDebug = unknown SlurmdTimeout = 500 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 21.08.6 SrunEpilog = (null) SrunPortRange = 40001-49999 SrunProlog = (null) StateSaveLocation = /pbs/slurm/prod21.08.6/var/spool/slurmctld SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = INFINITE SuspendTimeout = 30 sec SwitchParameters = (null) SwitchType = switch/none TaskEpilog = /etc/slurm/taskepilog.sh TaskPlugin = task/cgroup,task/affinity TaskPluginParam = (null type) TaskProlog = /etc/slurm/taskprolog.sh TCPTimeout = 6 sec TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/none TrackWCKey = No TreeWidth = 50 UsePam = Yes UnkillableStepProgram = (null) UnkillableStepTimeout = 120 sec VSizeFactor = 0 percent WaitTime = 0 sec X11Parameters = (null) Cgroup Support Configuration: AllowedDevicesFile = (null) AllowedKmemSpace = (null) AllowedRAMSpace = 100.0% AllowedSwapSpace = 0.0% CgroupAutomount = no CgroupMountpoint = (null) CgroupPlugin = (null) ConstrainCores = no ConstrainDevices = no ConstrainKmemSpace = no ConstrainRAMSpace = no ConstrainSwapSpace = no MaxKmemPercent = 100.0% MaxRAMPercent = 100.0% MaxSwapPercent = 100.0% MemorySwappiness = (null) MinKmemSpace = 30 MB MinRAMSpace = 30 MB TaskAffinity = no Slurmctld(primary) at ...01 is UP Slurmctld(backup) at ...02 is UP