The end goal is to see the following 2 things –

jobs under the slurmstepd cgroup path, and

the cpu,cpuset,memory at least in the cgroup.controllers file within the jobs cgroups.controller list.

The pattern you have would be the processes left after boot, first failed slurmd service start which leaves a slurmstepd infinity process, and then the second slurmd starts. In your case there is a second slurmstepd infinity process. As to why those specifics I can’t answer that one sitting here without poking at it more.

Having that slurmstepd infinity running with the cgroups needed ( for us at a minimum cpuset, cpu and memory – YMMV depending on the cgroups.conf settings ) before slurmd tries to start is what enables slurmd to start.

The necessary piece to this working is that the required controls are available at the parent of the path before the slurmd and in particular slurmstepd infinity start.

Our cgroup.conf file is:

CgroupAutomount=yes

ConstrainCores=yes

ConstrainRAMSpace=yes

CgroupPlugin=cgroup/v2

AllowedSwapSpace=1

ConstrainDevices=yes

ConstrainSwapSpace=yes

So the resulting missing piece to get slurmd to start at boot is corrected by running these mods to the cgroup controls before the slurmd service attempts to start. As a test, on your system as it is now without adding anything I’ve mentioned, try having a cgroup.conf with zero Constrain statements. My bet is that in that case slurmd starts clean on boot in that case. I hope that the bug fix does not change slurmd to be more liberal about checking the cgroup control list. – it took a while before I trusted that the controls were actually there so knowing if slurmd starts the controls are there is great.

/usr/bin/echo +cpu +cpuset +memory >> /sys/fs/cgroup/cgroup.subtree_control && \

/usr/bin/echo +cpu +cpuset +memory >> /sys/fs/cgroup/system.slice/cgroup.subtree_control

The job cgroup propagation ( contents of cgroup.controllers files along the cgroup path ) after slurmd + slurmstepd infinity start is via the cgroup path established under slurmstepd.scope . If there is no slurmstepd infinity slurm will start one; if slurmstepd infinity is running and it sets up at minimum the cgroups slurmd needs based on what is in cgroups.con. then slurmd doesn’t end up starting more slurmstepd infinity processes. My recollection is that first slurmstepd infinity does set up the needed cgroup controllers which is why a second slurmd attempt then starts.

To see slurmd complaining about the specifics try disabling slurmd service, reboot, set SLURM_DEBUG_FLAGS = cgroups then run slurmd -D -vvv manually . I am fairly sure that helps see the particulars better.

Theoretically in our setup with the slurm-cgrepair.service we force a slurmstepd infinity process to be running prior* to the slurmd service finishing * ( IDK the PID order says otherwise )

# systemctl show slurmd |egrep cgrepair

After=network-online.target systemd-journald.socket slurm-cgrepair.service remote-fs.target system.slice sysinit.target nvidia.service munge.service basic.target

The resulting behavior of this setup is as we expect – the slurmd service is running on nodes after reboot without intervention. Our steps may not be all necessary, but they are sufficient.

The list of cgroup controllers ( cpu , cpuset, memory for slurmstepd.scope/job_xxxx ) for processes further down the cgoup path can only be a subset of any parent in the cgroup path ( cgroup , cpuset, memory, pid for slurmstepd.scope ).

You asked in the context of what our process tree looks like – here is that information. I add the cgoup field in top for ongoing assurance that user processes are under the slurmstepd.scope path.

This is the process tree on our nodes.

# ps aux |grep slurm |head -n 15 |sed 's/xxxx/aUser/g'

root 8687 0.0 0.0 6471088 34044 ? Ss Apr03 0:29 /usr/sbin/slurmd -D -s

root 8694 0.0 0.0 33668 1080 ? S Apr03 0:00 /usr/sbin/slurmstepd infinity

root 2942928 0.0 0.0 311804 7416 ? Sl Apr06 0:42 slurmstepd: [35400562.extern]

root 2942930 0.0 0.0 311804 7164 ? Sl Apr06 0:43 slurmstepd: [35400563.extern]

root 2942933 0.0 0.0 311804 7144 ? Sl Apr06 0:45 slurmstepd: [35400564.extern]

root 2942935 0.0 0.0 311804 7280 ? Sl Apr06 0:38 slurmstepd: [35400565.extern]

root 2942953 0.0 0.0 312164 7496 ? Sl Apr06 0:45 slurmstepd: [35400564.batch]

root 2942958 0.0 0.0 312164 7620 ? Sl Apr06 0:41 slurmstepd: [35400562.batch]

root 2942960 0.0 0.0 312164 7636 ? Sl Apr06 0:43 slurmstepd: [35400563.batch]

root 2942962 0.0 0.0 312164 7728 ? Sl Apr06 0:41 slurmstepd: [35400565.batch]

aUser 2942972 0.0 0.0 12868 3072 ? SN Apr06 0:00 /bin/bash /var/spool/slurmd/job35400562/slurm_script

aUser 2942973 0.0 0.0 12868 2868 ? SN Apr06 0:00 /bin/bash /var/spool/slurmd/job35400564/slurm_script

aUser 2942974 0.0 0.0 12868 3000 ? SN Apr06 0:00 /bin/bash /var/spool/slurmd/job35400565/slurm_script

aUser 2942975 0.0 0.0 12868 2980 ? SN Apr06 0:00 /bin/bash /var/spool/slurmd/job35400563/slurm_script

root 2944250 0.0 0.0 311804 7248 ? Sl Apr06 0:44 slurmstepd: [35400838.extern]

# pgrep slurm |head -n4 |xargs |sed 's/ /,/g' |xargs -L1 -I{} top -bn1 -Hi -p {}

top - 16:24:16 up 7 days, 21:09, 2 users, load average: 48.07, 47.39, 46.90

Threads: 10 total, 0 running, 10 sleeping, 0 stopped, 0 zombie

%Cpu(s): 76.6 us, 1.1 sy, 1.8 ni, 12.1 id, 7.6 wa, 0.4 hi, 0.4 si, 0.0 st

KiB Mem : 39548588+total, 48296324 free, 28092556 used, 31909702+buff/cache

KiB Swap: 2097148 total, 1354112 free, 743036 used. 32913958+avail Mem

PID USER RES SHR S %CPU %MEM TIME+ P CGROUPS COMMAND

8687 root 34044 7664 S 0.0 0.0 0:04.28 8 0::/system.slice/slurmd.service slurmd

8694 root 1080 936 S 0.0 0.0 0:00.00 46 0::/system.slice/slurmstepd.scope/system slurmstepd

2942928 root 7412 6488 S 0.0 0.0 0:00.01 24 0::/system.slice/slurmstepd.scope/job_35400562/step_extern/slurm slurmstepd

2942936 root 7412 6488 S 0.0 0.0 0:34.29 24 0::/system.slice/slurmstepd.scope/job_35400562/step_extern/slurm `- acctg

2942937 root 7412 6488 S 0.0 0.0 0:08.27 24 0::/system.slice/slurmstepd.scope/job_35400562/step_extern/slurm `- acctg_prof

2942938 root 7412 6488 S 0.0 0.0 0:00.05 24 0::/system.slice/slurmstepd.scope/job_35400562/step_extern/slurm `- slurmstepd

2942930 root 7164 6236 S 0.0 0.0 0:00.01 28 0::/system.slice/slurmstepd.scope/job_35400563/step_extern/slurm slurmstepd

2942939 root 7164 6236 S 0.0 0.0 0:36.40 28 0::/system.slice/slurmstepd.scope/job_35400563/step_extern/slurm `- acctg

2942940 root 7164 6236 S 0.0 0.0 0:07.10 28 0::/system.slice/slurmstepd.scope/job_35400563/step_extern/slurm `- acctg_prof

2942941 root 7164 6236 S 0.0 0.0 0:00.04 28 0::/system.slice/slurmstepd.scope/job_35400563/step_extern/slurm `- slurmstepd

# sacct -j 35400562 -p

35400547_10|run_s111.sh|general|slurm_account|1|RUNNING|0:0|

35400547_10.batch|batch||slurm_account|1|RUNNING|0:0|

35400547_10.extern|extern||slurm_account|1|RUNNING|0:0|

# scontrol listpids 35400562

PID JOBID STEPID LOCALID GLOBALID

-1 35400562 extern 0 0

2942928 35400562 extern - -

2942946 35400562 extern - -

2942972 35400562 batch 0 0

2942958 35400562 batch - -

2943039 35400562 batch - -

# cat /sys/fs/cgroup/system.slice/slurmstepd.scope/cgroup.controllers

cpuset cpu memory pids

# cat /sys/fs/cgroup/system.slice/slurmstepd.scope/job_35400562/step_extern/slurm/cgroup.controllers

cpuset cpu memory

From: Josef Dvoracek via slurm-users <slurm-users@lists.schedmd.com>
Sent: Thursday, April 11, 2024 3:28 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Re: Slurmd enabled crash with CgroupV2

thanks for hint.

so you end with two "slurmstepd infinity" processes like me when I tried this workaround?

[root@node ~]# ps aux | grep slurm
root        1833 0.0 0.0 33716 2188 ?        Ss   21:02   0:00 /usr/sbin/slurmstepd infinity
root        2259 0.0 0.0 236796 12108 ?        Ss   21:02   0:00 /usr/sbin/slurmd --systemd
root        2331 0.0 0.0 33716 1124 ?        S    21:02   0:00 /usr/sbin/slurmstepd infinity
root        2953 0.0 0.0 221944 1092 pts/0    S+   21:12   0:00 grep --color=auto slurm
[root@node ~]#

BTW, I found mention of change in slurm cgroupsv2 code in changelog of slurm for next release,

https://github.com/SchedMD/slurm/blob/master/NEWS

one can see here the commit

https://github.com/SchedMD/slurm/commit/c21b48e724ec6f36d82c8efb1b81b6025ede240d

referring to bug

https://bugs.schedmd.com/show_bug.cgi?id=19157

but as the bug is private, I can not see the bug description.

So perhaps with Slurm 24.xx release we'll see something new.

cheers

josef

On 11. 04. 24 19:53, Williams, Jenny Avis wrote:

There needs to be a slurmstepd infinity process running before slurmd starts.

This doc goes into it:
https://slurm.schedmd.com/cgroup_v2.html

Probably a better way to do this, but this is what we do to deal with that:

::::::::::::::

files/slurm-cgrepair.service

::::::::::::::

[Unit]

Before=slurmd.service slurmctld.service

After=nas-longleaf.mount remote-fs.target system.slice

[Service]

Type=oneshot

ExecStart=/callback/slurm-cgrepair.sh

[Install]

WantedBy=default.target

::::::::::::::

files/slurm-cgrepair.sh

::::::::::::::

#!/bin/bash

/usr/bin/echo +cpu +cpuset +memory >> /sys/fs/cgroup/cgroup.subtree_control && \

/usr/bin/echo +cpu +cpuset +memory >> /sys/fs/cgroup/system.slice/cgroup.subtree_control

/usr/sbin/slurmstepd infinity &

From: Josef Dvoracek via slurm-users <slurm-users@lists.schedmd.com>
Sent: Thursday, April 11, 2024 11:14 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Re: Slurmd enabled crash with CgroupV2

I observe same behavior on slurm 23.11.5 Rocky Linux8.9..

> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> memory pids
> [root@compute ~]# systemctl disable slurmd
> Removed /etc/systemd/system/multi-user.target.wants/slurmd.service.
> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> cpuset cpu io memory pids
> [root@compute ~]# systemctl enable slurmd
> Created symlink /etc/systemd/system/multi-user.target.wants/slurmd.service → /usr/lib/systemd/system/slurmd.service.
> [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
> cpuset cpu io memory pids

over time (i see this thread is ~1 year old, is here better / new understanding of this?

cheers

josef

On 23. 05. 23 12:46, Alan Orth wrote:

I notice the exact same behavior as Tristan. My CentOS Stream 8 system is in full unified cgroupv2 mode, the slurmd.service has a "Delegate=Yes" override added to it, and all cgroup stuff is added to slurm.conf and cgroup.conf, yet slurmd does not start after reboot. I don't understand what is happening, but I see the exact same behavior regarding the cgroup subtree_control with disabling / re-enabling slurmd.