<div dir="ltr"><div>Hi,</div><div>A few minutes ago recompiled the cgroups_v2 plugin from slurm with the fix included, replaced the old cgroups_v2.{a,la,so} files with the new ones on /usr/lib/slurm and now jobs work properly on that node.</div><div>Many thanks for all the help. Indeed, in a few months we will update to the most recent 23.xx or 24.xx eventually.<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jan 24, 2024 at 1:20 PM Tim Schneider <<a href="mailto:tim.schneider1@tu-darmstadt.de">tim.schneider1@tu-darmstadt.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>

<br>

I just tested with 23.02.7-1 and the issue is gone. So it seems like the <br>

patch got released.<br>

<br>

Best,<br>

<br>

Tim<br>

<br>

On 1/24/24 16:55, Stefan Fleischmann wrote:<br>

> On Wed, 24 Jan 2024 12:37:04 -0300 Cristóbal Navarro<br>

> <<a href="mailto:cristobal.navarro.g@gmail.com" target="_blank">cristobal.navarro.g@gmail.com</a>> wrote:<br>

>> Many thanks<br>

>> One question? Do we have to apply this patch (and recompile slurm i<br>

>> guess) only on the compute-node with problems?<br>

>> Also, I noticed the patch now appears as "obsolete", is that ok?<br>

> We have Slurm installed on a NFS share, so what I did was to recompile<br>

> it and then I only replaced the library lib/slurm/cgroup_v2.so. Good<br>

> enough for now, I've been planning to update to 23.11 anyway soon.<br>

><br>

> I suppose it's marked as obsolete because the patch went into a<br>

> release. According to the info in the bug report it should have been<br>

> included in 23.02.4.<br>

><br>

> Cheers,<br>

> Stefan<br>

><br>

>> On Wed, Jan 24, 2024 at 9:52 AM Stefan Fleischmann <<a href="mailto:sfle@kth.se" target="_blank">sfle@kth.se</a>><br>

>> wrote:<br>

>><br>

>>> Turns out I was wrong, this is not a problem in the kernel at all.<br>

>>> It's a known bug that is triggered by long bpf logs, see here<br>

>>>   <a href="https://bugs.schedmd.com/show_bug.cgi?id=17210" rel="noreferrer" target="_blank">https://bugs.schedmd.com/show_bug.cgi?id=17210</a><br>

>>><br>

>>> There is a patch included there.<br>

>>><br>

>>> Cheers,<br>

>>> Stefan<br>

>>><br>

>>> On Tue, 23 Jan 2024 15:28:59 +0100 Stefan Fleischmann <<a href="mailto:sfle@kth.se" target="_blank">sfle@kth.se</a>><br>

>>> wrote:<br>

>>>> I don't think there is much for SchedMD to do. As I said since it<br>

>>>> is working fine with newer kernels there doesn't seem to be any<br>

>>>> breaking change in cgroup2 in general, but only a regression<br>

>>>> introduced in one of the latest updates in 5.15.<br>

>>>><br>

>>>> If Slurm was doing something wrong with cgroup2, and it<br>

>>>> accidentally worked until this recent change, then other kernel<br>

>>>> versions should show the same behavior. But as far as I can tell<br>

>>>> it still works just fine with newer kernels.<br>

>>>><br>

>>>> Cheers,<br>

>>>> Stefan<br>

>>>><br>

>>>> On Tue, 23 Jan 2024 15:20:56 +0100<br>

>>>> Tim Schneider <<a href="mailto:tim.schneider1@tu-darmstadt.de" target="_blank">tim.schneider1@tu-darmstadt.de</a>> wrote:<br>

>>>>   <br>

>>>>> Hi,<br>

>>>>><br>

>>>>> I have filed a bug report with SchedMD<br>

>>>>> (<a href="https://bugs.schedmd.com/show_bug.cgi?id=18623" rel="noreferrer" target="_blank">https://bugs.schedmd.com/show_bug.cgi?id=18623</a>), but the<br>

>>>>> support told me they cannot invest time in this issue since I<br>

>>>>> don't have a support contract. Maybe they will look into it<br>

>>>>> once it affects more people or someone important enough.<br>

>>>>><br>

>>>>> So far, I have resorted to using 5.15.0-89-generic, but I am<br>

>>>>> also a bit concerned about the security aspect of this choice.<br>

>>>>><br>

>>>>> Best,<br>

>>>>><br>

>>>>> Tim<br>

>>>>><br>

>>>>> On 23.01.24 14:59, Stefan Fleischmann wrote:<br>

>>>>>> Hi!<br>

>>>>>><br>

>>>>>> I'm seeing the same in our environment. My conclusion is that<br>

>>>>>> it is a regression in the Ubuntu 5.15 kernel, introduced with<br>

>>>>>> 5.15.0-90-generic. Last working kernel version is<br>

>>>>>> 5.15.0-89-generic. I have filed a bug report here:<br>

>>>>>> <a href="https://bugs.launchpad.net/bugs/2050098" rel="noreferrer" target="_blank">https://bugs.launchpad.net/bugs/2050098</a><br>

>>>>>><br>

>>>>>> Please add yourself to the affected users in the bug report<br>

>>>>>> so it hopefully gets more attention.<br>

>>>>>><br>

>>>>>> I've tested with newer kernels (6.5, 6.6 and 6.7) and the<br>

>>>>>> problem does not exist there. 6.5 is the latest hwe kernel<br>

>>>>>> for 22.04 and would be an option for now. Reverting back to<br>

>>>>>> 5.15.0-89 would work as well, but I haven't looked into the<br>

>>>>>> security aspects of that.<br>

>>>>>><br>

>>>>>> Cheers,<br>

>>>>>> Stefan<br>

>>>>>><br>

>>>>>> On Mon, 22 Jan 2024 13:31:15 -0300<br>

>>>>>> cristobal.navarro.g at <a href="http://gmail.com" rel="noreferrer" target="_blank">gmail.com</a> wrote:<br>

>>>>>>   <br>

>>>>>>> Hi Tim and community,<br>

>>>>>>> We are currently having the same issue (cgroups not working<br>

>>>>>>> it seems, showing all GPUs on jobs) on a GPU-compute node<br>

>>>>>>> (DGX A100) a couple of days ago after a full update (apt<br>

>>>>>>> upgrade). Now whenever we launch a job for that partition,<br>

>>>>>>> we get the error message mentioned by Tim. As a note, we<br>

>>>>>>> have another custom GPU-compute node with L40s, on a<br>

>>>>>>> different partition, and that one works fine. Before this<br>

>>>>>>> error, we always had small differences in kernel version<br>

>>>>>>> between nodes, so I am not sure if this can be the problem.<br>

>>>>>>> Nevertheless, here is the info of our nodes as well.<br>

>>>>>>><br>

>>>>>>> *[Problem node]* The DGX A100 node has this kernel<br>

>>>>>>> cnavarro at nodeGPU01:~$ uname -a<br>

>>>>>>> Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15<br>

>>>>>>> 20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux<br>

>>>>>>><br>

>>>>>>> *[Functioning node]* The Custom GPU node (L40s) has this<br>

>>>>>>> kernel cnavarro at nodeGPU02:~$ uname -a<br>

>>>>>>> Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14<br>

>>>>>>> 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux<br>

>>>>>>><br>

>>>>>>> *And the login node *(slurmctld)<br>

>>>>>>> ?  ~ uname -a<br>

>>>>>>> Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue<br>

>>>>>>> Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux<br>

>>>>>>><br>

>>>>>>> Any ideas what we should check?<br>

>>>>>>><br>

>>>>>>> On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider <tim.schneider1<br>

>>>>>>> at <a href="http://tu-darmstadt.de" rel="noreferrer" target="_blank">tu-darmstadt.de</a>> wrote:<br>

>>>>>>>   <br>

>>>>>>>> Hi,<br>

>>>>>>>><br>

>>>>>>>> I am using SLURM 22.05.9 on a small compute cluster. Since I<br>

>>>>>>>> reinstalled two of our nodes, I get the following error when<br>

>>>>>>>> launching a job:<br>

>>>>>>>><br>

>>>>>>>> slurmstepd: error: load_ebpf_prog: BPF load error (No space<br>

>>>>>>>> left on device). Please check your system limits (MEMLOCK).<br>

>>>>>>>><br>

>>>>>>>> Also the cgroups do not seem to work properly anymore, as I<br>

>>>>>>>> am able to see all GPUs even if I do not request them,<br>

>>>>>>>> which is not the case on the other nodes.<br>

>>>>>>>><br>

>>>>>>>> One difference I found between the updated nodes and the<br>

>>>>>>>> original nodes (both are Ubuntu 22.04) is the kernel<br>

>>>>>>>> version, which is "5.15.0-89-generic #99-Ubuntu SMP" on the<br>

>>>>>>>> functioning nodes and "5.15.0-91-generic #101-Ubuntu SMP"<br>

>>>>>>>> on the updated nodes. I could not figure out how to install<br>

>>>>>>>> the exact first kernel version on the updated nodes, but I<br>

>>>>>>>> noticed that when I reinstall 5.15.0 with this tool:<br>

>>>>>>>> <a href="https://github.com/pimlie/ubuntu-mainline-kernel.sh" rel="noreferrer" target="_blank">https://github.com/pimlie/ubuntu-mainline-kernel.sh</a>, the<br>

>>>>>>>> error message disappears. However, once I do that, the<br>

>>>>>>>> network driver does not function properly anymore, so this<br>

>>>>>>>> does not seem to be a good solution.<br>

>>>>>>>><br>

>>>>>>>> Has anyone seen this issue before or is there maybe<br>

>>>>>>>> something else I should take a look at? I am also happy to<br>

>>>>>>>> just find a workaround such that I can take these nodes<br>

>>>>>>>> back online.<br>

>>>>>>>><br>

>>>>>>>> I appreciate any help!<br>

>>>>>>>><br>

>>>>>>>> Thanks a lot in advance and best wishes,<br>

>>>>>>>><br>

>>>>>>>> Tim<br>

>>>>>>>><br>

>>>>>>>><br>

>>>>>>>>   <br>

>>>>   <br>

>>>   <br>

>><br>

</blockquote></div><br clear="all"><br><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div>Cristóbal A. Navarro</div></div></div>