[slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

Tue Jul 11 14:47:07 UTC 2023

Progress on getting slurmd to start under cgroupv2

Issue: slurmd 22.05.6 will not start when using cgroupv2

Expected result: even after reboot slurmd will start up without needing to manually add lines to /sys/fs/cgroup files.

When started as service the error is:

# systemctl status slurmd
* slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmd.service.d
           `-extendUnit.conf
   Active: failed (Result: exit-code) since Tue 2023-07-11 10:29:23 EDT; 2s ago
  Process: 11395 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Main PID: 11395 (code=exited, status=1/FAILURE)

Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: Started Slurm node daemon.
Jul 11 10:29:23 g1803jles01.ll.unc.edu slurmd[11395]: slurmd: slurmd version 22.05.6 started
Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: slurmd.service: Failed with result 'exit-code'.

When started at the command line the output is:

# slurmd -D -vvv 2>&1 |egrep error
slurmd: error: Controller cpuset is not enabled!
slurmd: error: Controller cpu is not enabled!
slurmd: error: Controller cpuset is not enabled!
slurmd: error: Controller cpu is not enabled!
slurmd: error: Controller cpuset is not enabled!
slurmd: error: Controller cpu is not enabled!
slurmd: error: Controller cpuset is not enabled!
slurmd: error: Controller cpu is not enabled!
slurmd: error: cpu cgroup controller is not available.
slurmd: error: There's an issue initializing memory or cpu controller
slurmd: error: Couldn't load specified plugin name for jobacct_gather/cgroup: Plugin init() callback failed
slurmd: error: cannot create jobacct_gather context for jobacct_gather/cgroup

Steps to mitigate the issue:

While the following steps do not solve the issue, they do get the system in a state such that slurmd will start, at least until next reboot.  The re-install slurm-slurmd is a one-time step to ensure that local service modifications are out of the picture.  Currently, even after reboot the cgroup echo steps are necessary at a minimum.

#!/bin/bash
/usr/bin/dnf -y reinstall slurm-slurmd
systemctl daemon-reload
/usr/bin/pkill -f '/usr/sbin/slurmstepd infinity'
systemctl enable slurmd
systemctl stop dcismeng.service && \
/usr/bin/echo +cpu +cpuset +memory >> /sys/fs/cgroup/cgroup.subtree_control && \
/usr/bin/echo +cpu +cpuset +memory >> /sys/fs/cgroup/system.slice/cgroup.subtree_control && \
systemctl start slurmd && \
 echo 'run this: systemctl start dcismeng'

Environment:

# scontrol show config
Configuration data as of 2023-07-11T10:39:48
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,qos,safe
AccountingStorageHost   = m1006
AccountingStorageExternalHost = (null)
AccountingStorageParameters = (null)
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreFlags    = (null)
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = No
AuthAltTypes            = (null)
AuthAltParameters       = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BcastExclude            = /lib,/usr/lib,/lib64,/usr/lib64
BcastParameters         = (null)
BOOT_TIME               = 2023-07-11T10:04:31
BurstBufferType         = (null)
CliFilterPlugins        = (null)
ClusterName             = ASlurmCluster
CommunicationParameters = (null)
CompleteWait            = 0 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = OnDemand,Performance,UserSpace
CredType                = cred/munge
DebugFlags              = (null)
DefMemPerNode           = UNLIMITED
DependencyParameters    = kill_invalid_depend
DisableRootJobs         = No
EioTimeout              = 60
EnforcePartLimits       = ANY
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = gpu
GpuFreqDef              = high,memory=high
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 65533 sec
InteractiveStepOptions  = --interactive --preserve-env --pty $SHELL
JobAcctGatherFrequency  = task=15
JobAcctGatherType       = jobacct_gather/cgroup
JobAcctGatherParams     = (null)
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = lua
KillOnBadExit           = 0
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Licenses                = mplus:1,nonmem:32
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 90001
MaxDBDMsgs              = 701360
MaxJobCount             = 350000
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxNodeCount            = 340
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MessageTimeout          = 60 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
NEXT_JOB_ID             = 12286313
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = (null)
PowerParameters         = (null)
PowerPlugin             =
PreemptMode             = OFF
PreemptType             = preempt/none
PreemptExemptTime       = 00:00:00
PrEpParameters          = (null)
PrEpPlugins             = prep/script
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 14-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = SMALL_RELATIVE_TO_TIME,CALCULATE_RUNNING,MAX_TRES
PriorityMaxAge          = 60-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 10000
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 10000
PriorityWeightJobSize   = 1000
PriorityWeightPartition = 1000
PriorityWeightQOS       = 1000
PriorityWeightTRES      = CPU=1000,Mem=4000,GRES/gpu=3000
PrivateData             = none
ProctrackType           = proctrack/cgroup
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = Alloc,Contain,X11
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = /usr/sbin/reboot
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 2
RoutePlugin             = route/default
SchedulerParameters     = batch_sched_delay=10,bf_continue,bf_max_job_part=1000,bf_max_job_test=10000,bf_max_job_user=100,bf_resolution=300,bf_window=10080,bf_yield_interval=1000000,default_queue_depth=1000,partition_job_depth=600,sched_min_interval=20000000,defer,max_rpc_cnt=80
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
ScronParameters         = (null)
SelectType              = select/cons_tres
SelectTypeParameters    = CR_CPU_MEMORY
SlurmUser               = slurm(47)
SlurmctldAddr           = (null)
SlurmctldDebug          = info
SlurmctldHost[0]        = ASlurmCluster-sched(x.x.x.x)
SlurmctldLogFile        = /data/slurm/slurmctld.log
SlurmctldPort           = 6820-6824
SlurmctldSyslogDebug    = (null)
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 6000 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = /var/log/slurm/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdSyslogDebug       = (null)
SlurmdTimeout           = 600 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 22.05.6
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /data/slurm/slurmctld
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = INFINITE
SuspendTimeout          = 30 sec
SwitchParameters        = (null)
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = cgroup,affinity
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 50
UsePam                  = No
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 600 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
X11Parameters           = home_xauthority

Cgroup Support Configuration:
AllowedKmemSpace        = (null)
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 1.0%
CgroupAutomount         = yes
CgroupMountpoint        = /sys/fs/cgroup
CgroupPlugin            = cgroup/v2
ConstrainCores          = yes
ConstrainDevices        = yes
ConstrainKmemSpace      = no
ConstrainRAMSpace       = yes
ConstrainSwapSpace      = yes
IgnoreSystemd           = no
IgnoreSystemdOnFailure  = no
MaxKmemPercent          = 100.0%
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = (null)
MinKmemSpace            = 30 MB
MinRAMSpace             = 30 MB

Slurmctld(primary) at ASlurmCluster-sched is UP

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230711/5752ab84/attachment-0001.htm>