[slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

Thu Jul 13 10:45:04 UTC 2023

Hi Jenny,

ok, I see. You are using the exact same Slurm version and a very similar 
OS version/distribution as we do.

You have to consider that cpuset support is not available in cgroup/v2 
in kernel versions below 5.2 (see "Cgroups v2 controllers" in "man 
cgroups" on your system). So some of the warnings/errors you see - at 
least "Controller cpuset is not enabled" - is expected (and slurmd 
should start nevertheless).
This btw is one of the reasons why we stick with cgroup/v1 for the time 
being.

We did some tests with cgroups/v2 and in our case slurmd started with no 
problems (except the error/warning regarding the cpuset controller). But 
we have a slightly different configuration. You use
JobAcctGatherType       = jobacct_gather/cgroup
ProctrackType           = proctrack/cgroup
TaskPlugin              = cgroup,affinity
CgroupPlugin            = cgroup/v2

We use for the respective settings:
JobAcctGatherType       = jobacct_gather/linux
ProctrackType           = proctrack/cgroup
TaskPlugin              = task/affinity,task/cgroup
CgroupPlugin            = (null) - i.e. we don't set that one in cgroup.conf

Maybe using the same settings as we do helps in your case?
Please be aware that you should change JobAcctGatherType only when there 
are no running job steps!

Regards,
Hermann

On 7/12/23 16:50, Williams, Jenny Avis wrote:
> The systems have only cgroup/v2 enabled
> 	# mount |egrep cgroup
> 	cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
> Distribution and kernel
> 	RedHat 8.7
> 	4.18.0-348.2.1.el8_5.x86_64
> 
> 
> 
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Hermann Schwärzler
> Sent: Wednesday, July 12, 2023 4:36 AM
> To: slurm-users at lists.schedmd.com
> Subject: Re: [slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start
> 
> Hi Jenny,
> 
> I *guess* you have a system that has both cgroup/v1 and cgroup/v2 enabled.
> 
> Which Linux distribution are you using? And which kernel version?
> What is the output of
>     mount | grep cgroup
> What if you do not restrict the cgroup-version Slurm can use to
> cgroup/v2 but omit "CgroupPlugin=..." from your cgroup.conf?
> 
> Regards,
> Hermann
> 
> On 7/11/23 19:41, Williams, Jenny Avis wrote:
>> Additional configuration information -- /etc/slurm/cgroup.conf
>>
>> CgroupAutomount=yes
>>
>> ConstrainCores=yes
>>
>> ConstrainRAMSpace=yes
>>
>> CgroupPlugin=cgroup/v2
>>
>> AllowedSwapSpace=1
>>
>> ConstrainSwapSpace=yes
>>
>> ConstrainDevices=yes
>>
>> *From:* Williams, Jenny Avis
>> *Sent:* Tuesday, July 11, 2023 10:47 AM
>> *To:* slurm-users at schedmd.com
>> *Subject:* cgroupv2 + slurmd - external cgroup changes needed to get
>> daemon to start
>>
>> Progress on getting slurmd to start under cgroupv2
>>
>> Issue: slurmd 22.05.6 will not start when using cgroupv2
>>
>> Expected result: even after reboot slurmd will start up without
>> needing to manually add lines to /sys/fs/cgroup files.
>>
>> When started as service the error is:
>>
>> # systemctl status slurmd
>>
>> * slurmd.service - Slurm node daemon
>>
>>      Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled;
>> vendor preset: disabled)
>>
>>     Drop-In: /etc/systemd/system/slurmd.service.d
>>
>>              `-extendUnit.conf
>>
>>      Active: failed (Result: exit-code) since Tue 2023-07-11 10:29:23
>> EDT; 2s ago
>>
>>     Process: 11395 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS
>> (code=exited, status=1/FAILURE)
>>
>> Main PID: 11395 (code=exited, status=1/FAILURE)
>>
>> Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: Started Slurm node
>> daemon.
>>
>> Jul 11 10:29:23 g1803jles01.ll.unc.edu slurmd[11395]: slurmd: slurmd
>> version 22.05.6 started
>>
>> Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: slurmd.service:
>> Main process exited, code=exited, status=1/FAILURE
>>
>> Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: slurmd.service:
>> Failed with result 'exit-code'.
>>
>> When started at the command line the output is:
>>
>> # slurmd -D -vvv 2>&1 |egrep error
>>
>> slurmd: error: Controller cpuset is not enabled!
>>
>> slurmd: error: Controller cpu is not enabled!
>>
>> slurmd: error: Controller cpuset is not enabled!
>>
>> slurmd: error: Controller cpu is not enabled!
>>
>> slurmd: error: Controller cpuset is not enabled!
>>
>> slurmd: error: Controller cpu is not enabled!
>>
>> slurmd: error: Controller cpuset is not enabled!
>>
>> slurmd: error: Controller cpu is not enabled!
>>
>> slurmd: error: cpu cgroup controller is not available.
>>
>> slurmd: error: There's an issue initializing memory or cpu controller
>>
>> slurmd: error: Couldn't load specified plugin name for
>> jobacct_gather/cgroup: Plugin init() callback failed
>>
>> slurmd: error: cannot create jobacct_gather context for
>> jobacct_gather/cgroup
>>
>> Steps to mitigate the issue:
>>
>> While the following steps do not solve the issue, they do get the
>> system in a state such that slurmd will start, at least until next
>> reboot.  The re-install slurm-slurmd is a one-time step to ensure that
>> local service modifications are out of the picture. */Currently, even
>> after reboot the cgroup echo steps are necessary at a minimum./*
>>
>> #!/bin/bash
>>
>> /usr/bin/dnf -y reinstall slurm-slurmd
>>
>> systemctl daemon-reload
>>
>> /usr/bin/pkill -f '/usr/sbin/slurmstepd infinity'
>>
>> systemctl enable slurmd
>>
>> systemctl stop dcismeng.service && \
>>
>> *//usr/bin/echo +cpu +cpuset +memory >>
>> /sys/fs/cgroup/cgroup.subtree_control && \/*
>>
>> *//usr/bin/echo +cpu +cpuset +memory >>
>> /sys/fs/cgroup/system.slice/cgroup.subtree_control && \/*
>>
>> systemctl start slurmd && \
>>
>>    echo 'run this: systemctl start dcismeng'
>>
>> Environment:
>>
>> # scontrol show config
>>
>> Configuration data as of 2023-07-11T10:39:48
>>
>> AccountingStorageBackupHost = (null)
>>
>> AccountingStorageEnforce = associations,limits,qos,safe
>>
>> AccountingStorageHost   = m1006
>>
>> AccountingStorageExternalHost = (null)
>>
>> AccountingStorageParameters = (null)
>>
>> AccountingStoragePort   = 6819
>>
>> AccountingStorageTRES   =
>> cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu
>>
>> AccountingStorageType   = accounting_storage/slurmdbd
>>
>> AccountingStorageUser   = N/A
>>
>> AccountingStoreFlags    = (null)
>>
>> AcctGatherEnergyType    = acct_gather_energy/none
>>
>> AcctGatherFilesystemType = acct_gather_filesystem/none
>>
>> AcctGatherInterconnectType = acct_gather_interconnect/none
>>
>> AcctGatherNodeFreq      = 0 sec
>>
>> AcctGatherProfileType   = acct_gather_profile/none
>>
>> AllowSpecResourcesUsage = No
>>
>> AuthAltTypes            = (null)
>>
>> AuthAltParameters       = (null)
>>
>> AuthInfo                = (null)
>>
>> AuthType                = auth/munge
>>
>> BatchStartTimeout       = 10 sec
>>
>> BcastExclude            = /lib,/usr/lib,/lib64,/usr/lib64
>>
>> BcastParameters         = (null)
>>
>> BOOT_TIME               = 2023-07-11T10:04:31
>>
>> BurstBufferType         = (null)
>>
>> CliFilterPlugins        = (null)
>>
>> ClusterName             = ASlurmCluster
>>
>> CommunicationParameters = (null)
>>
>> CompleteWait            = 0 sec
>>
>> CoreSpecPlugin          = core_spec/none
>>
>> CpuFreqDef              = Unknown
>>
>> CpuFreqGovernors        = OnDemand,Performance,UserSpace
>>
>> CredType                = cred/munge
>>
>> DebugFlags              = (null)
>>
>> DefMemPerNode           = UNLIMITED
>>
>> DependencyParameters    = kill_invalid_depend
>>
>> DisableRootJobs         = No
>>
>> EioTimeout              = 60
>>
>> EnforcePartLimits       = ANY
>>
>> Epilog                  = (null)
>>
>> EpilogMsgTime           = 2000 usec
>>
>> EpilogSlurmctld         = (null)
>>
>> ExtSensorsType          = ext_sensors/none
>>
>> ExtSensorsFreq          = 0 sec
>>
>> FairShareDampeningFactor = 1
>>
>> FederationParameters    = (null)
>>
>> FirstJobId              = 1
>>
>> GetEnvTimeout           = 2 sec
>>
>> GresTypes               = gpu
>>
>> GpuFreqDef              = high,memory=high
>>
>> GroupUpdateForce        = 1
>>
>> GroupUpdateTime         = 600 sec
>>
>> HASH_VAL                = Match
>>
>> HealthCheckInterval     = 0 sec
>>
>> HealthCheckNodeState    = ANY
>>
>> HealthCheckProgram      = (null)
>>
>> InactiveLimit           = 65533 sec
>>
>> InteractiveStepOptions  = --interactive --preserve-env --pty $SHELL
>>
>> JobAcctGatherFrequency  = task=15
>>
>> JobAcctGatherType       = jobacct_gather/cgroup
>>
>> JobAcctGatherParams     = (null)
>>
>> JobCompHost             = localhost
>>
>> JobCompLoc              = /var/log/slurm_jobcomp.log
>>
>> JobCompPort             = 0
>>
>> JobCompType             = jobcomp/none
>>
>> JobCompUser             = root
>>
>> JobContainerType        = job_container/none
>>
>> JobCredentialPrivateKey = (null)
>>
>> JobCredentialPublicCertificate = (null)
>>
>> JobDefaults             = (null)
>>
>> JobFileAppend           = 0
>>
>> JobRequeue              = 1
>>
>> JobSubmitPlugins        = lua
>>
>> KillOnBadExit           = 0
>>
>> KillWait                = 30 sec
>>
>> LaunchParameters        = (null)
>>
>> LaunchType              = launch/slurm
>>
>> Licenses                = mplus:1,nonmem:32
>>
>> LogTimeFormat           = iso8601_ms
>>
>> MailDomain              = (null)
>>
>> MailProg                = /bin/mail
>>
>> MaxArraySize            = 90001
>>
>> MaxDBDMsgs              = 701360
>>
>> MaxJobCount             = 350000
>>
>> MaxJobId                = 67043328
>>
>> MaxMemPerNode           = UNLIMITED
>>
>> MaxNodeCount            = 340
>>
>> MaxStepCount            = 40000
>>
>> MaxTasksPerNode         = 512
>>
>> MCSPlugin               = mcs/none
>>
>> MCSParameters           = (null)
>>
>> MessageTimeout          = 60 sec
>>
>> MinJobAge               = 300 sec
>>
>> MpiDefault              = none
>>
>> MpiParams               = (null)
>>
>> NEXT_JOB_ID             = 12286313
>>
>> NodeFeaturesPlugins     = (null)
>>
>> OverTimeLimit           = 0 min
>>
>> PluginDir               = /usr/lib64/slurm
>>
>> PlugStackConfig         = (null)
>>
>> PowerParameters         = (null)
>>
>> PowerPlugin             =
>>
>> PreemptMode             = OFF
>>
>> PreemptType             = preempt/none
>>
>> PreemptExemptTime       = 00:00:00
>>
>> PrEpParameters          = (null)
>>
>> PrEpPlugins             = prep/script
>>
>> PriorityParameters      = (null)
>>
>> PrioritySiteFactorParameters = (null)
>>
>> PrioritySiteFactorPlugin = (null)
>>
>> PriorityDecayHalfLife   = 14-00:00:00
>>
>> PriorityCalcPeriod      = 00:05:00
>>
>> PriorityFavorSmall      = No
>>
>> PriorityFlags           =
>> SMALL_RELATIVE_TO_TIME,CALCULATE_RUNNING,MAX_TRES
>>
>> PriorityMaxAge          = 60-00:00:00
>>
>> PriorityUsageResetPeriod = NONE
>>
>> PriorityType            = priority/multifactor
>>
>> PriorityWeightAge       = 10000
>>
>> PriorityWeightAssoc     = 0
>>
>> PriorityWeightFairShare = 10000
>>
>> PriorityWeightJobSize   = 1000
>>
>> PriorityWeightPartition = 1000
>>
>> PriorityWeightQOS       = 1000
>>
>> PriorityWeightTRES      = CPU=1000,Mem=4000,GRES/gpu=3000
>>
>> PrivateData             = none
>>
>> ProctrackType           = proctrack/cgroup
>>
>> Prolog                  = (null)
>>
>> PrologEpilogTimeout     = 65534
>>
>> PrologSlurmctld         = (null)
>>
>> PrologFlags             = Alloc,Contain,X11
>>
>> PropagatePrioProcess    = 0
>>
>> PropagateResourceLimits = ALL
>>
>> PropagateResourceLimitsExcept = (null)
>>
>> RebootProgram           = /usr/sbin/reboot
>>
>> ReconfigFlags           = (null)
>>
>> RequeueExit             = (null)
>>
>> RequeueExitHold         = (null)
>>
>> ResumeFailProgram       = (null)
>>
>> ResumeProgram           = (null)
>>
>> ResumeRate              = 300 nodes/min
>>
>> ResumeTimeout           = 60 sec
>>
>> ResvEpilog              = (null)
>>
>> ResvOverRun             = 0 min
>>
>> ResvProlog              = (null)
>>
>> ReturnToService         = 2
>>
>> RoutePlugin             = route/default
>>
>> SchedulerParameters     =
>> batch_sched_delay=10,bf_continue,bf_max_job_part=1000,bf_max_job_test=
>> 10000,bf_max_job_user=100,bf_resolution=300,bf_window=10080,bf_yield_i
>> nterval=1000000,default_queue_depth=1000,partition_job_depth=600,sched
>> _min_interval=20000000,defer,max_rpc_cnt=80
>>
>> SchedulerTimeSlice      = 30 sec
>>
>> SchedulerType           = sched/backfill
>>
>> ScronParameters         = (null)
>>
>> SelectType              = select/cons_tres
>>
>> SelectTypeParameters    = CR_CPU_MEMORY
>>
>> SlurmUser               = slurm(47)
>>
>> SlurmctldAddr           = (null)
>>
>> SlurmctldDebug          = info
>>
>> SlurmctldHost[0]        = ASlurmCluster-sched(x.x.x.x)
>>
>> SlurmctldLogFile        = /data/slurm/slurmctld.log
>>
>> SlurmctldPort           = 6820-6824
>>
>> SlurmctldSyslogDebug    = (null)
>>
>> SlurmctldPrimaryOffProg = (null)
>>
>> SlurmctldPrimaryOnProg  = (null)
>>
>> SlurmctldTimeout        = 6000 sec
>>
>> SlurmctldParameters     = (null)
>>
>> SlurmdDebug             = info
>>
>> SlurmdLogFile           = /var/log/slurm/slurmd.log
>>
>> SlurmdParameters        = (null)
>>
>> SlurmdPidFile           = /var/run/slurmd.pid
>>
>> SlurmdPort              = 6818
>>
>> SlurmdSpoolDir          = /var/spool/slurmd
>>
>> SlurmdSyslogDebug       = (null)
>>
>> SlurmdTimeout           = 600 sec
>>
>> SlurmdUser              = root(0)
>>
>> SlurmSchedLogFile       = (null)
>>
>> SlurmSchedLogLevel      = 0
>>
>> SlurmctldPidFile        = /var/run/slurmctld.pid
>>
>> SlurmctldPlugstack      = (null)
>>
>> SLURM_CONF              = /etc/slurm/slurm.conf
>>
>> SLURM_VERSION           = 22.05.6
>>
>> SrunEpilog              = (null)
>>
>> SrunPortRange           = 0-0
>>
>> SrunProlog              = (null)
>>
>> StateSaveLocation       = /data/slurm/slurmctld
>>
>> SuspendExcNodes         = (null)
>>
>> SuspendExcParts         = (null)
>>
>> SuspendProgram          = (null)
>>
>> SuspendRate             = 60 nodes/min
>>
>> SuspendTime             = INFINITE
>>
>> SuspendTimeout          = 30 sec
>>
>> SwitchParameters        = (null)
>>
>> SwitchType              = switch/none
>>
>> TaskEpilog              = (null)
>>
>> TaskPlugin              = cgroup,affinity
>>
>> TaskPluginParam         = (null type)
>>
>> TaskProlog              = (null)
>>
>> TCPTimeout              = 2 sec
>>
>> TmpFS                   = /tmp
>>
>> TopologyParam           = (null)
>>
>> TopologyPlugin          = topology/none
>>
>> TrackWCKey              = No
>>
>> TreeWidth               = 50
>>
>> UsePam                  = No
>>
>> UnkillableStepProgram   = (null)
>>
>> UnkillableStepTimeout   = 600 sec
>>
>> VSizeFactor             = 0 percent
>>
>> WaitTime                = 0 sec
>>
>> X11Parameters           = home_xauthority
>>
>> Cgroup Support Configuration:
>>
>> AllowedKmemSpace        = (null)
>>
>> AllowedRAMSpace         = 100.0%
>>
>> AllowedSwapSpace        = 1.0%
>>
>> CgroupAutomount         = yes
>>
>> CgroupMountpoint        = /sys/fs/cgroup
>>
>> CgroupPlugin            = cgroup/v2
>>
>> ConstrainCores          = yes
>>
>> ConstrainDevices        = yes
>>
>> ConstrainKmemSpace      = no
>>
>> ConstrainRAMSpace       = yes
>>
>> ConstrainSwapSpace      = yes
>>
>> IgnoreSystemd           = no
>>
>> IgnoreSystemdOnFailure  = no
>>
>> MaxKmemPercent          = 100.0%
>>
>> MaxRAMPercent           = 100.0%
>>
>> MaxSwapPercent          = 100.0%
>>
>> MemorySwappiness        = (null)
>>
>> MinKmemSpace            = 30 MB
>>
>> MinRAMSpace             = 30 MB
>>
>> Slurmctld(primary) at ASlurmCluster-sched is UP
>>
>