[slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

Wed Jul 12 14:50:24 UTC 2023

The systems have only cgroup/v2 enabled
	# mount |egrep cgroup
	cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
Distribution and kernel
	RedHat 8.7 
	4.18.0-348.2.1.el8_5.x86_64

-----Original Message-----
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Hermann Schwärzler
Sent: Wednesday, July 12, 2023 4:36 AM
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

Hi Jenny,

I *guess* you have a system that has both cgroup/v1 and cgroup/v2 enabled.

Which Linux distribution are you using? And which kernel version?
What is the output of
   mount | grep cgroup
What if you do not restrict the cgroup-version Slurm can use to
cgroup/v2 but omit "CgroupPlugin=..." from your cgroup.conf?

Regards,
Hermann

On 7/11/23 19:41, Williams, Jenny Avis wrote:
> Additional configuration information -- /etc/slurm/cgroup.conf
> 
> CgroupAutomount=yes
> 
> ConstrainCores=yes
> 
> ConstrainRAMSpace=yes
> 
> CgroupPlugin=cgroup/v2
> 
> AllowedSwapSpace=1
> 
> ConstrainSwapSpace=yes
> 
> ConstrainDevices=yes
> 
> *From:* Williams, Jenny Avis
> *Sent:* Tuesday, July 11, 2023 10:47 AM
> *To:* slurm-users at schedmd.com
> *Subject:* cgroupv2 + slurmd - external cgroup changes needed to get 
> daemon to start
> 
> Progress on getting slurmd to start under cgroupv2
> 
> Issue: slurmd 22.05.6 will not start when using cgroupv2
> 
> Expected result: even after reboot slurmd will start up without 
> needing to manually add lines to /sys/fs/cgroup files.
> 
> When started as service the error is:
> 
> # systemctl status slurmd
> 
> * slurmd.service - Slurm node daemon
> 
>     Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; 
> vendor preset: disabled)
> 
>    Drop-In: /etc/systemd/system/slurmd.service.d
> 
>             `-extendUnit.conf
> 
>     Active: failed (Result: exit-code) since Tue 2023-07-11 10:29:23 
> EDT; 2s ago
> 
>    Process: 11395 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS 
> (code=exited, status=1/FAILURE)
> 
> Main PID: 11395 (code=exited, status=1/FAILURE)
> 
> Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: Started Slurm node 
> daemon.
> 
> Jul 11 10:29:23 g1803jles01.ll.unc.edu slurmd[11395]: slurmd: slurmd 
> version 22.05.6 started
> 
> Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: slurmd.service: 
> Main process exited, code=exited, status=1/FAILURE
> 
> Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: slurmd.service: 
> Failed with result 'exit-code'.
> 
> When started at the command line the output is:
> 
> # slurmd -D -vvv 2>&1 |egrep error
> 
> slurmd: error: Controller cpuset is not enabled!
> 
> slurmd: error: Controller cpu is not enabled!
> 
> slurmd: error: Controller cpuset is not enabled!
> 
> slurmd: error: Controller cpu is not enabled!
> 
> slurmd: error: Controller cpuset is not enabled!
> 
> slurmd: error: Controller cpu is not enabled!
> 
> slurmd: error: Controller cpuset is not enabled!
> 
> slurmd: error: Controller cpu is not enabled!
> 
> slurmd: error: cpu cgroup controller is not available.
> 
> slurmd: error: There's an issue initializing memory or cpu controller
> 
> slurmd: error: Couldn't load specified plugin name for
> jobacct_gather/cgroup: Plugin init() callback failed
> 
> slurmd: error: cannot create jobacct_gather context for 
> jobacct_gather/cgroup
> 
> Steps to mitigate the issue:
> 
> While the following steps do not solve the issue, they do get the 
> system in a state such that slurmd will start, at least until next 
> reboot.  The re-install slurm-slurmd is a one-time step to ensure that 
> local service modifications are out of the picture. */Currently, even 
> after reboot the cgroup echo steps are necessary at a minimum./*
> 
> #!/bin/bash
> 
> /usr/bin/dnf -y reinstall slurm-slurmd
> 
> systemctl daemon-reload
> 
> /usr/bin/pkill -f '/usr/sbin/slurmstepd infinity'
> 
> systemctl enable slurmd
> 
> systemctl stop dcismeng.service && \
> 
> *//usr/bin/echo +cpu +cpuset +memory >> 
> /sys/fs/cgroup/cgroup.subtree_control && \/*
> 
> *//usr/bin/echo +cpu +cpuset +memory >> 
> /sys/fs/cgroup/system.slice/cgroup.subtree_control && \/*
> 
> systemctl start slurmd && \
> 
>   echo 'run this: systemctl start dcismeng'
> 
> Environment:
> 
> # scontrol show config
> 
> Configuration data as of 2023-07-11T10:39:48
> 
> AccountingStorageBackupHost = (null)
> 
> AccountingStorageEnforce = associations,limits,qos,safe
> 
> AccountingStorageHost   = m1006
> 
> AccountingStorageExternalHost = (null)
> 
> AccountingStorageParameters = (null)
> 
> AccountingStoragePort   = 6819
> 
> AccountingStorageTRES   =
> cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu
> 
> AccountingStorageType   = accounting_storage/slurmdbd
> 
> AccountingStorageUser   = N/A
> 
> AccountingStoreFlags    = (null)
> 
> AcctGatherEnergyType    = acct_gather_energy/none
> 
> AcctGatherFilesystemType = acct_gather_filesystem/none
> 
> AcctGatherInterconnectType = acct_gather_interconnect/none
> 
> AcctGatherNodeFreq      = 0 sec
> 
> AcctGatherProfileType   = acct_gather_profile/none
> 
> AllowSpecResourcesUsage = No
> 
> AuthAltTypes            = (null)
> 
> AuthAltParameters       = (null)
> 
> AuthInfo                = (null)
> 
> AuthType                = auth/munge
> 
> BatchStartTimeout       = 10 sec
> 
> BcastExclude            = /lib,/usr/lib,/lib64,/usr/lib64
> 
> BcastParameters         = (null)
> 
> BOOT_TIME               = 2023-07-11T10:04:31
> 
> BurstBufferType         = (null)
> 
> CliFilterPlugins        = (null)
> 
> ClusterName             = ASlurmCluster
> 
> CommunicationParameters = (null)
> 
> CompleteWait            = 0 sec
> 
> CoreSpecPlugin          = core_spec/none
> 
> CpuFreqDef              = Unknown
> 
> CpuFreqGovernors        = OnDemand,Performance,UserSpace
> 
> CredType                = cred/munge
> 
> DebugFlags              = (null)
> 
> DefMemPerNode           = UNLIMITED
> 
> DependencyParameters    = kill_invalid_depend
> 
> DisableRootJobs         = No
> 
> EioTimeout              = 60
> 
> EnforcePartLimits       = ANY
> 
> Epilog                  = (null)
> 
> EpilogMsgTime           = 2000 usec
> 
> EpilogSlurmctld         = (null)
> 
> ExtSensorsType          = ext_sensors/none
> 
> ExtSensorsFreq          = 0 sec
> 
> FairShareDampeningFactor = 1
> 
> FederationParameters    = (null)
> 
> FirstJobId              = 1
> 
> GetEnvTimeout           = 2 sec
> 
> GresTypes               = gpu
> 
> GpuFreqDef              = high,memory=high
> 
> GroupUpdateForce        = 1
> 
> GroupUpdateTime         = 600 sec
> 
> HASH_VAL                = Match
> 
> HealthCheckInterval     = 0 sec
> 
> HealthCheckNodeState    = ANY
> 
> HealthCheckProgram      = (null)
> 
> InactiveLimit           = 65533 sec
> 
> InteractiveStepOptions  = --interactive --preserve-env --pty $SHELL
> 
> JobAcctGatherFrequency  = task=15
> 
> JobAcctGatherType       = jobacct_gather/cgroup
> 
> JobAcctGatherParams     = (null)
> 
> JobCompHost             = localhost
> 
> JobCompLoc              = /var/log/slurm_jobcomp.log
> 
> JobCompPort             = 0
> 
> JobCompType             = jobcomp/none
> 
> JobCompUser             = root
> 
> JobContainerType        = job_container/none
> 
> JobCredentialPrivateKey = (null)
> 
> JobCredentialPublicCertificate = (null)
> 
> JobDefaults             = (null)
> 
> JobFileAppend           = 0
> 
> JobRequeue              = 1
> 
> JobSubmitPlugins        = lua
> 
> KillOnBadExit           = 0
> 
> KillWait                = 30 sec
> 
> LaunchParameters        = (null)
> 
> LaunchType              = launch/slurm
> 
> Licenses                = mplus:1,nonmem:32
> 
> LogTimeFormat           = iso8601_ms
> 
> MailDomain              = (null)
> 
> MailProg                = /bin/mail
> 
> MaxArraySize            = 90001
> 
> MaxDBDMsgs              = 701360
> 
> MaxJobCount             = 350000
> 
> MaxJobId                = 67043328
> 
> MaxMemPerNode           = UNLIMITED
> 
> MaxNodeCount            = 340
> 
> MaxStepCount            = 40000
> 
> MaxTasksPerNode         = 512
> 
> MCSPlugin               = mcs/none
> 
> MCSParameters           = (null)
> 
> MessageTimeout          = 60 sec
> 
> MinJobAge               = 300 sec
> 
> MpiDefault              = none
> 
> MpiParams               = (null)
> 
> NEXT_JOB_ID             = 12286313
> 
> NodeFeaturesPlugins     = (null)
> 
> OverTimeLimit           = 0 min
> 
> PluginDir               = /usr/lib64/slurm
> 
> PlugStackConfig         = (null)
> 
> PowerParameters         = (null)
> 
> PowerPlugin             =
> 
> PreemptMode             = OFF
> 
> PreemptType             = preempt/none
> 
> PreemptExemptTime       = 00:00:00
> 
> PrEpParameters          = (null)
> 
> PrEpPlugins             = prep/script
> 
> PriorityParameters      = (null)
> 
> PrioritySiteFactorParameters = (null)
> 
> PrioritySiteFactorPlugin = (null)
> 
> PriorityDecayHalfLife   = 14-00:00:00
> 
> PriorityCalcPeriod      = 00:05:00
> 
> PriorityFavorSmall      = No
> 
> PriorityFlags           = 
> SMALL_RELATIVE_TO_TIME,CALCULATE_RUNNING,MAX_TRES
> 
> PriorityMaxAge          = 60-00:00:00
> 
> PriorityUsageResetPeriod = NONE
> 
> PriorityType            = priority/multifactor
> 
> PriorityWeightAge       = 10000
> 
> PriorityWeightAssoc     = 0
> 
> PriorityWeightFairShare = 10000
> 
> PriorityWeightJobSize   = 1000
> 
> PriorityWeightPartition = 1000
> 
> PriorityWeightQOS       = 1000
> 
> PriorityWeightTRES      = CPU=1000,Mem=4000,GRES/gpu=3000
> 
> PrivateData             = none
> 
> ProctrackType           = proctrack/cgroup
> 
> Prolog                  = (null)
> 
> PrologEpilogTimeout     = 65534
> 
> PrologSlurmctld         = (null)
> 
> PrologFlags             = Alloc,Contain,X11
> 
> PropagatePrioProcess    = 0
> 
> PropagateResourceLimits = ALL
> 
> PropagateResourceLimitsExcept = (null)
> 
> RebootProgram           = /usr/sbin/reboot
> 
> ReconfigFlags           = (null)
> 
> RequeueExit             = (null)
> 
> RequeueExitHold         = (null)
> 
> ResumeFailProgram       = (null)
> 
> ResumeProgram           = (null)
> 
> ResumeRate              = 300 nodes/min
> 
> ResumeTimeout           = 60 sec
> 
> ResvEpilog              = (null)
> 
> ResvOverRun             = 0 min
> 
> ResvProlog              = (null)
> 
> ReturnToService         = 2
> 
> RoutePlugin             = route/default
> 
> SchedulerParameters     =
> batch_sched_delay=10,bf_continue,bf_max_job_part=1000,bf_max_job_test=
> 10000,bf_max_job_user=100,bf_resolution=300,bf_window=10080,bf_yield_i
> nterval=1000000,default_queue_depth=1000,partition_job_depth=600,sched
> _min_interval=20000000,defer,max_rpc_cnt=80
> 
> SchedulerTimeSlice      = 30 sec
> 
> SchedulerType           = sched/backfill
> 
> ScronParameters         = (null)
> 
> SelectType              = select/cons_tres
> 
> SelectTypeParameters    = CR_CPU_MEMORY
> 
> SlurmUser               = slurm(47)
> 
> SlurmctldAddr           = (null)
> 
> SlurmctldDebug          = info
> 
> SlurmctldHost[0]        = ASlurmCluster-sched(x.x.x.x)
> 
> SlurmctldLogFile        = /data/slurm/slurmctld.log
> 
> SlurmctldPort           = 6820-6824
> 
> SlurmctldSyslogDebug    = (null)
> 
> SlurmctldPrimaryOffProg = (null)
> 
> SlurmctldPrimaryOnProg  = (null)
> 
> SlurmctldTimeout        = 6000 sec
> 
> SlurmctldParameters     = (null)
> 
> SlurmdDebug             = info
> 
> SlurmdLogFile           = /var/log/slurm/slurmd.log
> 
> SlurmdParameters        = (null)
> 
> SlurmdPidFile           = /var/run/slurmd.pid
> 
> SlurmdPort              = 6818
> 
> SlurmdSpoolDir          = /var/spool/slurmd
> 
> SlurmdSyslogDebug       = (null)
> 
> SlurmdTimeout           = 600 sec
> 
> SlurmdUser              = root(0)
> 
> SlurmSchedLogFile       = (null)
> 
> SlurmSchedLogLevel      = 0
> 
> SlurmctldPidFile        = /var/run/slurmctld.pid
> 
> SlurmctldPlugstack      = (null)
> 
> SLURM_CONF              = /etc/slurm/slurm.conf
> 
> SLURM_VERSION           = 22.05.6
> 
> SrunEpilog              = (null)
> 
> SrunPortRange           = 0-0
> 
> SrunProlog              = (null)
> 
> StateSaveLocation       = /data/slurm/slurmctld
> 
> SuspendExcNodes         = (null)
> 
> SuspendExcParts         = (null)
> 
> SuspendProgram          = (null)
> 
> SuspendRate             = 60 nodes/min
> 
> SuspendTime             = INFINITE
> 
> SuspendTimeout          = 30 sec
> 
> SwitchParameters        = (null)
> 
> SwitchType              = switch/none
> 
> TaskEpilog              = (null)
> 
> TaskPlugin              = cgroup,affinity
> 
> TaskPluginParam         = (null type)
> 
> TaskProlog              = (null)
> 
> TCPTimeout              = 2 sec
> 
> TmpFS                   = /tmp
> 
> TopologyParam           = (null)
> 
> TopologyPlugin          = topology/none
> 
> TrackWCKey              = No
> 
> TreeWidth               = 50
> 
> UsePam                  = No
> 
> UnkillableStepProgram   = (null)
> 
> UnkillableStepTimeout   = 600 sec
> 
> VSizeFactor             = 0 percent
> 
> WaitTime                = 0 sec
> 
> X11Parameters           = home_xauthority
> 
> Cgroup Support Configuration:
> 
> AllowedKmemSpace        = (null)
> 
> AllowedRAMSpace         = 100.0%
> 
> AllowedSwapSpace        = 1.0%
> 
> CgroupAutomount         = yes
> 
> CgroupMountpoint        = /sys/fs/cgroup
> 
> CgroupPlugin            = cgroup/v2
> 
> ConstrainCores          = yes
> 
> ConstrainDevices        = yes
> 
> ConstrainKmemSpace      = no
> 
> ConstrainRAMSpace       = yes
> 
> ConstrainSwapSpace      = yes
> 
> IgnoreSystemd           = no
> 
> IgnoreSystemdOnFailure  = no
> 
> MaxKmemPercent          = 100.0%
> 
> MaxRAMPercent           = 100.0%
> 
> MaxSwapPercent          = 100.0%
> 
> MemorySwappiness        = (null)
> 
> MinKmemSpace            = 30 MB
> 
> MinRAMSpace             = 30 MB
> 
> Slurmctld(primary) at ASlurmCluster-sched is UP
>