[slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

Williams, Jenny Avis jennyw at email.unc.edu
Fri Jul 14 23:45:56 UTC 2023


Thanks, Herman, for the feedback.

My reason for posting was to request some inspection of the systemd file for slurmd such that this "nudging" would not be necessary.

I'd like to explore that a little more -- it looks like cgroupsv2 cpusets are working for us in this configuration, except for having to "nudge" the daemon to start with the steps originally listed.  

This document from RedHat explicitly describes enabling cpusets under cgroupsv2 under rhel 8 -- this at least appears to be working in our configuration.
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/using-cgroups-v2-to-control-distribution-of-cpu-time-for-applications_managing-monitoring-and-updating-the-kernel

This document is were I got the steps to get the daemon working and cpusets enabled.  I've checked the contents of job_*/cpuset.cpus under /s

Regards,
Jenny 


-----Original Message-----
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Hermann Schwärzler
Sent: Thursday, July 13, 2023 6:45 AM
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

Hi Jenny,

ok, I see. You are using the exact same Slurm version and a very similar OS version/distribution as we do.

You have to consider that cpuset support is not available in cgroup/v2 in kernel versions below 5.2 (see "Cgroups v2 controllers" in "man cgroups" on your system). So some of the warnings/errors you see - at least "Controller cpuset is not enabled" - is expected (and slurmd should start nevertheless).
This btw is one of the reasons why we stick with cgroup/v1 for the time being.

We did some tests with cgroups/v2 and in our case slurmd started with no problems (except the error/warning regarding the cpuset controller). But we have a slightly different configuration. You use
JobAcctGatherType       = jobacct_gather/cgroup
ProctrackType           = proctrack/cgroup
TaskPlugin              = cgroup,affinity
CgroupPlugin            = cgroup/v2

We use for the respective settings:
JobAcctGatherType       = jobacct_gather/linux
ProctrackType           = proctrack/cgroup
TaskPlugin              = task/affinity,task/cgroup
CgroupPlugin            = (null) - i.e. we don't set that one in cgroup.conf

Maybe using the same settings as we do helps in your case?
Please be aware that you should change JobAcctGatherType only when there are no running job steps!

Regards,
Hermann


On 7/12/23 16:50, Williams, Jenny Avis wrote:
> The systems have only cgroup/v2 enabled
> 	# mount |egrep cgroup
> 	cgroup2 on /sys/fs/cgroup type cgroup2 
> (rw,nosuid,nodev,noexec,relatime,nsdelegate)
> Distribution and kernel
> 	RedHat 8.7
> 	4.18.0-348.2.1.el8_5.x86_64
> 
> 
> 
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of 
> Hermann Schwärzler
> Sent: Wednesday, July 12, 2023 4:36 AM
> To: slurm-users at lists.schedmd.com
> Subject: Re: [slurm-users] cgroupv2 + slurmd - external cgroup changes 
> needed to get daemon to start
> 
> Hi Jenny,
> 
> I *guess* you have a system that has both cgroup/v1 and cgroup/v2 enabled.
> 
> Which Linux distribution are you using? And which kernel version?
> What is the output of
>     mount | grep cgroup
> What if you do not restrict the cgroup-version Slurm can use to
> cgroup/v2 but omit "CgroupPlugin=..." from your cgroup.conf?
> 
> Regards,
> Hermann
> 
> On 7/11/23 19:41, Williams, Jenny Avis wrote:
>> Additional configuration information -- /etc/slurm/cgroup.conf
>>
>> CgroupAutomount=yes
>>
>> ConstrainCores=yes
>>
>> ConstrainRAMSpace=yes
>>
>> CgroupPlugin=cgroup/v2
>>
>> AllowedSwapSpace=1
>>
>> ConstrainSwapSpace=yes
>>
>> ConstrainDevices=yes
>>
>> *From:* Williams, Jenny Avis
>> *Sent:* Tuesday, July 11, 2023 10:47 AM
>> *To:* slurm-users at schedmd.com
>> *Subject:* cgroupv2 + slurmd - external cgroup changes needed to get 
>> daemon to start
>>
>> Progress on getting slurmd to start under cgroupv2
>>
>> Issue: slurmd 22.05.6 will not start when using cgroupv2
>>
>> Expected result: even after reboot slurmd will start up without 
>> needing to manually add lines to /sys/fs/cgroup files.
>>
>> When started as service the error is:
>>
>> # systemctl status slurmd
>>
>> * slurmd.service - Slurm node daemon
>>
>>      Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; 
>> vendor preset: disabled)
>>
>>     Drop-In: /etc/systemd/system/slurmd.service.d
>>
>>              `-extendUnit.conf
>>
>>      Active: failed (Result: exit-code) since Tue 2023-07-11 10:29:23 
>> EDT; 2s ago
>>
>>     Process: 11395 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS 
>> (code=exited, status=1/FAILURE)
>>
>> Main PID: 11395 (code=exited, status=1/FAILURE)
>>
>> Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: Started Slurm node 
>> daemon.
>>
>> Jul 11 10:29:23 g1803jles01.ll.unc.edu slurmd[11395]: slurmd: slurmd 
>> version 22.05.6 started
>>
>> Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: slurmd.service:
>> Main process exited, code=exited, status=1/FAILURE
>>
>> Jul 11 10:29:23 g1803jles01.ll.unc.edu systemd[1]: slurmd.service:
>> Failed with result 'exit-code'.
>>
>> When started at the command line the output is:
>>
>> # slurmd -D -vvv 2>&1 |egrep error
>>
>> slurmd: error: Controller cpuset is not enabled!
>>
>> slurmd: error: Controller cpu is not enabled!
>>
>> slurmd: error: Controller cpuset is not enabled!
>>
>> slurmd: error: Controller cpu is not enabled!
>>
>> slurmd: error: Controller cpuset is not enabled!
>>
>> slurmd: error: Controller cpu is not enabled!
>>
>> slurmd: error: Controller cpuset is not enabled!
>>
>> slurmd: error: Controller cpu is not enabled!
>>
>> slurmd: error: cpu cgroup controller is not available.
>>
>> slurmd: error: There's an issue initializing memory or cpu controller
>>
>> slurmd: error: Couldn't load specified plugin name for
>> jobacct_gather/cgroup: Plugin init() callback failed
>>
>> slurmd: error: cannot create jobacct_gather context for 
>> jobacct_gather/cgroup
>>
>> Steps to mitigate the issue:
>>
>> While the following steps do not solve the issue, they do get the 
>> system in a state such that slurmd will start, at least until next 
>> reboot.  The re-install slurm-slurmd is a one-time step to ensure 
>> that local service modifications are out of the picture. */Currently, 
>> even after reboot the cgroup echo steps are necessary at a minimum./*
>>
>> #!/bin/bash
>>
>> /usr/bin/dnf -y reinstall slurm-slurmd
>>
>> systemctl daemon-reload
>>
>> /usr/bin/pkill -f '/usr/sbin/slurmstepd infinity'
>>
>> systemctl enable slurmd
>>
>> systemctl stop dcismeng.service && \
>>
>> *//usr/bin/echo +cpu +cpuset +memory >> 
>> /sys/fs/cgroup/cgroup.subtree_control && \/*
>>
>> *//usr/bin/echo +cpu +cpuset +memory >> 
>> /sys/fs/cgroup/system.slice/cgroup.subtree_control && \/*
>>
>> systemctl start slurmd && \
>>
>>    echo 'run this: systemctl start dcismeng'
>>
>> Environment:
>>
>> # scontrol show config
>>
>> Configuration data as of 2023-07-11T10:39:48
>>
>> AccountingStorageBackupHost = (null)
>>
>> AccountingStorageEnforce = associations,limits,qos,safe
>>
>> AccountingStorageHost   = m1006
>>
>> AccountingStorageExternalHost = (null)
>>
>> AccountingStorageParameters = (null)
>>
>> AccountingStoragePort   = 6819
>>
>> AccountingStorageTRES   =
>> cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu
>>
>> AccountingStorageType   = accounting_storage/slurmdbd
>>
>> AccountingStorageUser   = N/A
>>
>> AccountingStoreFlags    = (null)
>>
>> AcctGatherEnergyType    = acct_gather_energy/none
>>
>> AcctGatherFilesystemType = acct_gather_filesystem/none
>>
>> AcctGatherInterconnectType = acct_gather_interconnect/none
>>
>> AcctGatherNodeFreq      = 0 sec
>>
>> AcctGatherProfileType   = acct_gather_profile/none
>>
>> AllowSpecResourcesUsage = No
>>
>> AuthAltTypes            = (null)
>>
>> AuthAltParameters       = (null)
>>
>> AuthInfo                = (null)
>>
>> AuthType                = auth/munge
>>
>> BatchStartTimeout       = 10 sec
>>
>> BcastExclude            = /lib,/usr/lib,/lib64,/usr/lib64
>>
>> BcastParameters         = (null)
>>
>> BOOT_TIME               = 2023-07-11T10:04:31
>>
>> BurstBufferType         = (null)
>>
>> CliFilterPlugins        = (null)
>>
>> ClusterName             = ASlurmCluster
>>
>> CommunicationParameters = (null)
>>
>> CompleteWait            = 0 sec
>>
>> CoreSpecPlugin          = core_spec/none
>>
>> CpuFreqDef              = Unknown
>>
>> CpuFreqGovernors        = OnDemand,Performance,UserSpace
>>
>> CredType                = cred/munge
>>
>> DebugFlags              = (null)
>>
>> DefMemPerNode           = UNLIMITED
>>
>> DependencyParameters    = kill_invalid_depend
>>
>> DisableRootJobs         = No
>>
>> EioTimeout              = 60
>>
>> EnforcePartLimits       = ANY
>>
>> Epilog                  = (null)
>>
>> EpilogMsgTime           = 2000 usec
>>
>> EpilogSlurmctld         = (null)
>>
>> ExtSensorsType          = ext_sensors/none
>>
>> ExtSensorsFreq          = 0 sec
>>
>> FairShareDampeningFactor = 1
>>
>> FederationParameters    = (null)
>>
>> FirstJobId              = 1
>>
>> GetEnvTimeout           = 2 sec
>>
>> GresTypes               = gpu
>>
>> GpuFreqDef              = high,memory=high
>>
>> GroupUpdateForce        = 1
>>
>> GroupUpdateTime         = 600 sec
>>
>> HASH_VAL                = Match
>>
>> HealthCheckInterval     = 0 sec
>>
>> HealthCheckNodeState    = ANY
>>
>> HealthCheckProgram      = (null)
>>
>> InactiveLimit           = 65533 sec
>>
>> InteractiveStepOptions  = --interactive --preserve-env --pty $SHELL
>>
>> JobAcctGatherFrequency  = task=15
>>
>> JobAcctGatherType       = jobacct_gather/cgroup
>>
>> JobAcctGatherParams     = (null)
>>
>> JobCompHost             = localhost
>>
>> JobCompLoc              = /var/log/slurm_jobcomp.log
>>
>> JobCompPort             = 0
>>
>> JobCompType             = jobcomp/none
>>
>> JobCompUser             = root
>>
>> JobContainerType        = job_container/none
>>
>> JobCredentialPrivateKey = (null)
>>
>> JobCredentialPublicCertificate = (null)
>>
>> JobDefaults             = (null)
>>
>> JobFileAppend           = 0
>>
>> JobRequeue              = 1
>>
>> JobSubmitPlugins        = lua
>>
>> KillOnBadExit           = 0
>>
>> KillWait                = 30 sec
>>
>> LaunchParameters        = (null)
>>
>> LaunchType              = launch/slurm
>>
>> Licenses                = mplus:1,nonmem:32
>>
>> LogTimeFormat           = iso8601_ms
>>
>> MailDomain              = (null)
>>
>> MailProg                = /bin/mail
>>
>> MaxArraySize            = 90001
>>
>> MaxDBDMsgs              = 701360
>>
>> MaxJobCount             = 350000
>>
>> MaxJobId                = 67043328
>>
>> MaxMemPerNode           = UNLIMITED
>>
>> MaxNodeCount            = 340
>>
>> MaxStepCount            = 40000
>>
>> MaxTasksPerNode         = 512
>>
>> MCSPlugin               = mcs/none
>>
>> MCSParameters           = (null)
>>
>> MessageTimeout          = 60 sec
>>
>> MinJobAge               = 300 sec
>>
>> MpiDefault              = none
>>
>> MpiParams               = (null)
>>
>> NEXT_JOB_ID             = 12286313
>>
>> NodeFeaturesPlugins     = (null)
>>
>> OverTimeLimit           = 0 min
>>
>> PluginDir               = /usr/lib64/slurm
>>
>> PlugStackConfig         = (null)
>>
>> PowerParameters         = (null)
>>
>> PowerPlugin             =
>>
>> PreemptMode             = OFF
>>
>> PreemptType             = preempt/none
>>
>> PreemptExemptTime       = 00:00:00
>>
>> PrEpParameters          = (null)
>>
>> PrEpPlugins             = prep/script
>>
>> PriorityParameters      = (null)
>>
>> PrioritySiteFactorParameters = (null)
>>
>> PrioritySiteFactorPlugin = (null)
>>
>> PriorityDecayHalfLife   = 14-00:00:00
>>
>> PriorityCalcPeriod      = 00:05:00
>>
>> PriorityFavorSmall      = No
>>
>> PriorityFlags           =
>> SMALL_RELATIVE_TO_TIME,CALCULATE_RUNNING,MAX_TRES
>>
>> PriorityMaxAge          = 60-00:00:00
>>
>> PriorityUsageResetPeriod = NONE
>>
>> PriorityType            = priority/multifactor
>>
>> PriorityWeightAge       = 10000
>>
>> PriorityWeightAssoc     = 0
>>
>> PriorityWeightFairShare = 10000
>>
>> PriorityWeightJobSize   = 1000
>>
>> PriorityWeightPartition = 1000
>>
>> PriorityWeightQOS       = 1000
>>
>> PriorityWeightTRES      = CPU=1000,Mem=4000,GRES/gpu=3000
>>
>> PrivateData             = none
>>
>> ProctrackType           = proctrack/cgroup
>>
>> Prolog                  = (null)
>>
>> PrologEpilogTimeout     = 65534
>>
>> PrologSlurmctld         = (null)
>>
>> PrologFlags             = Alloc,Contain,X11
>>
>> PropagatePrioProcess    = 0
>>
>> PropagateResourceLimits = ALL
>>
>> PropagateResourceLimitsExcept = (null)
>>
>> RebootProgram           = /usr/sbin/reboot
>>
>> ReconfigFlags           = (null)
>>
>> RequeueExit             = (null)
>>
>> RequeueExitHold         = (null)
>>
>> ResumeFailProgram       = (null)
>>
>> ResumeProgram           = (null)
>>
>> ResumeRate              = 300 nodes/min
>>
>> ResumeTimeout           = 60 sec
>>
>> ResvEpilog              = (null)
>>
>> ResvOverRun             = 0 min
>>
>> ResvProlog              = (null)
>>
>> ReturnToService         = 2
>>
>> RoutePlugin             = route/default
>>
>> SchedulerParameters     =
>> batch_sched_delay=10,bf_continue,bf_max_job_part=1000,bf_max_job_test
>> = 
>> 10000,bf_max_job_user=100,bf_resolution=300,bf_window=10080,bf_yield_
>> i 
>> nterval=1000000,default_queue_depth=1000,partition_job_depth=600,sche
>> d
>> _min_interval=20000000,defer,max_rpc_cnt=80
>>
>> SchedulerTimeSlice      = 30 sec
>>
>> SchedulerType           = sched/backfill
>>
>> ScronParameters         = (null)
>>
>> SelectType              = select/cons_tres
>>
>> SelectTypeParameters    = CR_CPU_MEMORY
>>
>> SlurmUser               = slurm(47)
>>
>> SlurmctldAddr           = (null)
>>
>> SlurmctldDebug          = info
>>
>> SlurmctldHost[0]        = ASlurmCluster-sched(x.x.x.x)
>>
>> SlurmctldLogFile        = /data/slurm/slurmctld.log
>>
>> SlurmctldPort           = 6820-6824
>>
>> SlurmctldSyslogDebug    = (null)
>>
>> SlurmctldPrimaryOffProg = (null)
>>
>> SlurmctldPrimaryOnProg  = (null)
>>
>> SlurmctldTimeout        = 6000 sec
>>
>> SlurmctldParameters     = (null)
>>
>> SlurmdDebug             = info
>>
>> SlurmdLogFile           = /var/log/slurm/slurmd.log
>>
>> SlurmdParameters        = (null)
>>
>> SlurmdPidFile           = /var/run/slurmd.pid
>>
>> SlurmdPort              = 6818
>>
>> SlurmdSpoolDir          = /var/spool/slurmd
>>
>> SlurmdSyslogDebug       = (null)
>>
>> SlurmdTimeout           = 600 sec
>>
>> SlurmdUser              = root(0)
>>
>> SlurmSchedLogFile       = (null)
>>
>> SlurmSchedLogLevel      = 0
>>
>> SlurmctldPidFile        = /var/run/slurmctld.pid
>>
>> SlurmctldPlugstack      = (null)
>>
>> SLURM_CONF              = /etc/slurm/slurm.conf
>>
>> SLURM_VERSION           = 22.05.6
>>
>> SrunEpilog              = (null)
>>
>> SrunPortRange           = 0-0
>>
>> SrunProlog              = (null)
>>
>> StateSaveLocation       = /data/slurm/slurmctld
>>
>> SuspendExcNodes         = (null)
>>
>> SuspendExcParts         = (null)
>>
>> SuspendProgram          = (null)
>>
>> SuspendRate             = 60 nodes/min
>>
>> SuspendTime             = INFINITE
>>
>> SuspendTimeout          = 30 sec
>>
>> SwitchParameters        = (null)
>>
>> SwitchType              = switch/none
>>
>> TaskEpilog              = (null)
>>
>> TaskPlugin              = cgroup,affinity
>>
>> TaskPluginParam         = (null type)
>>
>> TaskProlog              = (null)
>>
>> TCPTimeout              = 2 sec
>>
>> TmpFS                   = /tmp
>>
>> TopologyParam           = (null)
>>
>> TopologyPlugin          = topology/none
>>
>> TrackWCKey              = No
>>
>> TreeWidth               = 50
>>
>> UsePam                  = No
>>
>> UnkillableStepProgram   = (null)
>>
>> UnkillableStepTimeout   = 600 sec
>>
>> VSizeFactor             = 0 percent
>>
>> WaitTime                = 0 sec
>>
>> X11Parameters           = home_xauthority
>>
>> Cgroup Support Configuration:
>>
>> AllowedKmemSpace        = (null)
>>
>> AllowedRAMSpace         = 100.0%
>>
>> AllowedSwapSpace        = 1.0%
>>
>> CgroupAutomount         = yes
>>
>> CgroupMountpoint        = /sys/fs/cgroup
>>
>> CgroupPlugin            = cgroup/v2
>>
>> ConstrainCores          = yes
>>
>> ConstrainDevices        = yes
>>
>> ConstrainKmemSpace      = no
>>
>> ConstrainRAMSpace       = yes
>>
>> ConstrainSwapSpace      = yes
>>
>> IgnoreSystemd           = no
>>
>> IgnoreSystemdOnFailure  = no
>>
>> MaxKmemPercent          = 100.0%
>>
>> MaxRAMPercent           = 100.0%
>>
>> MaxSwapPercent          = 100.0%
>>
>> MemorySwappiness        = (null)
>>
>> MinKmemSpace            = 30 MB
>>
>> MinRAMSpace             = 30 MB
>>
>> Slurmctld(primary) at ASlurmCluster-sched is UP
>>
> 



More information about the slurm-users mailing list