[slurm-users] Re: Slurmd enabled crash with CgroupV2

11 Apr 2024


      thanks for hint.
so you end with two "slurmstepd infinity" processes like me when I tried 
this workaround?
[root@node ~]# ps aux | grep slurm
root        1833  0.0  0.0  33716  2188 ?        Ss   21:02   0:00 
/usr/sbin/slurmstepd infinity
root        2259  0.0  0.0 236796 12108 ?        Ss   21:02   0:00 
/usr/sbin/slurmd --systemd
root        2331  0.0  0.0  33716  1124 ?        S    21:02   0:00 
/usr/sbin/slurmstepd infinity
root        2953  0.0  0.0 221944  1092 pts/0    S+   21:12   0:00 grep 
--color=auto slurm
[root@node ~]#
BTW, I found mention of change in slurm cgroupsv2 code in changelog of 
slurm for next release,
https://github.com/SchedMD/slurm/blob/master/NEWS
one can see here the commit
https://github.com/SchedMD/slurm/commit/c21b48e724ec6f36d82c8efb1b81b6025ede...
referring to bug
https://bugs.schedmd.com/show_bug.cgi?id=19157
but as the bug is private, I can not see the bug description.
So perhaps with Slurm 24.xx release we'll see something new.
cheers
josef
On 11. 04. 24 19:53, Williams, Jenny Avis wrote:
...
There needs to be a slurmstepd infinity process running before slurmd 
starts.
This doc goes into it:
https://slurm.schedmd.com/cgroup_v2.html
Probably a better way to do this, but this is what we do to deal with 
that:
::::::::::::::
files/slurm-cgrepair.service
::::::::::::::
[Unit]
Before=slurmd.service slurmctld.service
After=nas-longleaf.mount remote-fs.target system.slice
[Service]
Type=oneshot
ExecStart=/callback/slurm-cgrepair.sh
[Install]
WantedBy=default.target
::::::::::::::
files/slurm-cgrepair.sh
::::::::::::::
#!/bin/bash
/usr/bin/echo +cpu +cpuset +memory >> 
/sys/fs/cgroup/cgroup.subtree_control && \
/usr/bin/echo +cpu +cpuset +memory >> 
/sys/fs/cgroup/system.slice/cgroup.subtree_control
/usr/sbin/slurmstepd infinity &
*From:*Josef Dvoracek via slurm-users slurm-users@lists.schedmd.com
*Sent:* Thursday, April 11, 2024 11:14 AM
*To:* slurm-users@lists.schedmd.com
*Subject:* [slurm-users] Re: Slurmd enabled crash with CgroupV2
I observe same behavior on slurm 23.11.5 Rocky Linux8.9..
...
[root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
memory pids
[root@compute ~]# systemctl disable slurmd
Removed /etc/systemd/system/multi-user.target.wants/slurmd.service.
[root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu io memory pids
[root@compute ~]# systemctl enable slurmd
Created symlink
/etc/systemd/system/multi-user.target.wants/slurmd.service → 
/usr/lib/systemd/system/slurmd.service.
...
[root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu io memory pids
over time (i see this thread is ~1 year old, is here better / new 
understanding of this?
cheers
josef
On 23. 05. 23 12:46, Alan Orth wrote:
I notice the exact same behavior as Tristan. My CentOS Stream 8
system is in full unified cgroupv2 mode, the slurmd.service has a
"Delegate=Yes" override added to it, and all cgroup stuff is added
to slurm.conf and cgroup.conf, yet slurmd does not start after
reboot. I don't understand what is happening, but I see the exact
same behavior regarding the cgroup subtree_control with disabling
/ re-enabling slurmd.

2025

2024

[slurm-users] Re: Slurmd enabled crash with CgroupV2