<div dir="ltr"><div>Hi,</div><div>A few minutes ago recompiled the cgroups_v2 plugin from slurm with the fix included, replaced the old cgroups_v2.{a,la,so} files with the new ones on /usr/lib/slurm and now jobs work properly on that node.</div><div>Many thanks for all the help. Indeed, in a few months we will update to the most recent 23.xx or 24.xx eventually.<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jan 24, 2024 at 1:20 PM Tim Schneider <<a href="mailto:tim.schneider1@tu-darmstadt.de">tim.schneider1@tu-darmstadt.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
<br>
I just tested with 23.02.7-1 and the issue is gone. So it seems like the <br>
patch got released.<br>
<br>
Best,<br>
<br>
Tim<br>
<br>
On 1/24/24 16:55, Stefan Fleischmann wrote:<br>
> On Wed, 24 Jan 2024 12:37:04 -0300 Cristóbal Navarro<br>
> <<a href="mailto:cristobal.navarro.g@gmail.com" target="_blank">cristobal.navarro.g@gmail.com</a>> wrote:<br>
>> Many thanks<br>
>> One question? Do we have to apply this patch (and recompile slurm i<br>
>> guess) only on the compute-node with problems?<br>
>> Also, I noticed the patch now appears as "obsolete", is that ok?<br>
> We have Slurm installed on a NFS share, so what I did was to recompile<br>
> it and then I only replaced the library lib/slurm/cgroup_v2.so. Good<br>
> enough for now, I've been planning to update to 23.11 anyway soon.<br>
><br>
> I suppose it's marked as obsolete because the patch went into a<br>
> release. According to the info in the bug report it should have been<br>
> included in 23.02.4.<br>
><br>
> Cheers,<br>
> Stefan<br>
><br>
>> On Wed, Jan 24, 2024 at 9:52 AM Stefan Fleischmann <<a href="mailto:sfle@kth.se" target="_blank">sfle@kth.se</a>><br>
>> wrote:<br>
>><br>
>>> Turns out I was wrong, this is not a problem in the kernel at all.<br>
>>> It's a known bug that is triggered by long bpf logs, see here<br>
>>> <a href="https://bugs.schedmd.com/show_bug.cgi?id=17210" rel="noreferrer" target="_blank">https://bugs.schedmd.com/show_bug.cgi?id=17210</a><br>
>>><br>
>>> There is a patch included there.<br>
>>><br>
>>> Cheers,<br>
>>> Stefan<br>
>>><br>
>>> On Tue, 23 Jan 2024 15:28:59 +0100 Stefan Fleischmann <<a href="mailto:sfle@kth.se" target="_blank">sfle@kth.se</a>><br>
>>> wrote:<br>
>>>> I don't think there is much for SchedMD to do. As I said since it<br>
>>>> is working fine with newer kernels there doesn't seem to be any<br>
>>>> breaking change in cgroup2 in general, but only a regression<br>
>>>> introduced in one of the latest updates in 5.15.<br>
>>>><br>
>>>> If Slurm was doing something wrong with cgroup2, and it<br>
>>>> accidentally worked until this recent change, then other kernel<br>
>>>> versions should show the same behavior. But as far as I can tell<br>
>>>> it still works just fine with newer kernels.<br>
>>>><br>
>>>> Cheers,<br>
>>>> Stefan<br>
>>>><br>
>>>> On Tue, 23 Jan 2024 15:20:56 +0100<br>
>>>> Tim Schneider <<a href="mailto:tim.schneider1@tu-darmstadt.de" target="_blank">tim.schneider1@tu-darmstadt.de</a>> wrote:<br>
>>>> <br>
>>>>> Hi,<br>
>>>>><br>
>>>>> I have filed a bug report with SchedMD<br>
>>>>> (<a href="https://bugs.schedmd.com/show_bug.cgi?id=18623" rel="noreferrer" target="_blank">https://bugs.schedmd.com/show_bug.cgi?id=18623</a>), but the<br>
>>>>> support told me they cannot invest time in this issue since I<br>
>>>>> don't have a support contract. Maybe they will look into it<br>
>>>>> once it affects more people or someone important enough.<br>
>>>>><br>
>>>>> So far, I have resorted to using 5.15.0-89-generic, but I am<br>
>>>>> also a bit concerned about the security aspect of this choice.<br>
>>>>><br>
>>>>> Best,<br>
>>>>><br>
>>>>> Tim<br>
>>>>><br>
>>>>> On 23.01.24 14:59, Stefan Fleischmann wrote:<br>
>>>>>> Hi!<br>
>>>>>><br>
>>>>>> I'm seeing the same in our environment. My conclusion is that<br>
>>>>>> it is a regression in the Ubuntu 5.15 kernel, introduced with<br>
>>>>>> 5.15.0-90-generic. Last working kernel version is<br>
>>>>>> 5.15.0-89-generic. I have filed a bug report here:<br>
>>>>>> <a href="https://bugs.launchpad.net/bugs/2050098" rel="noreferrer" target="_blank">https://bugs.launchpad.net/bugs/2050098</a><br>
>>>>>><br>
>>>>>> Please add yourself to the affected users in the bug report<br>
>>>>>> so it hopefully gets more attention.<br>
>>>>>><br>
>>>>>> I've tested with newer kernels (6.5, 6.6 and 6.7) and the<br>
>>>>>> problem does not exist there. 6.5 is the latest hwe kernel<br>
>>>>>> for 22.04 and would be an option for now. Reverting back to<br>
>>>>>> 5.15.0-89 would work as well, but I haven't looked into the<br>
>>>>>> security aspects of that.<br>
>>>>>><br>
>>>>>> Cheers,<br>
>>>>>> Stefan<br>
>>>>>><br>
>>>>>> On Mon, 22 Jan 2024 13:31:15 -0300<br>
>>>>>> cristobal.navarro.g at <a href="http://gmail.com" rel="noreferrer" target="_blank">gmail.com</a> wrote:<br>
>>>>>> <br>
>>>>>>> Hi Tim and community,<br>
>>>>>>> We are currently having the same issue (cgroups not working<br>
>>>>>>> it seems, showing all GPUs on jobs) on a GPU-compute node<br>
>>>>>>> (DGX A100) a couple of days ago after a full update (apt<br>
>>>>>>> upgrade). Now whenever we launch a job for that partition,<br>
>>>>>>> we get the error message mentioned by Tim. As a note, we<br>
>>>>>>> have another custom GPU-compute node with L40s, on a<br>
>>>>>>> different partition, and that one works fine. Before this<br>
>>>>>>> error, we always had small differences in kernel version<br>
>>>>>>> between nodes, so I am not sure if this can be the problem.<br>
>>>>>>> Nevertheless, here is the info of our nodes as well.<br>
>>>>>>><br>
>>>>>>> *[Problem node]* The DGX A100 node has this kernel<br>
>>>>>>> cnavarro at nodeGPU01:~$ uname -a<br>
>>>>>>> Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15<br>
>>>>>>> 20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux<br>
>>>>>>><br>
>>>>>>> *[Functioning node]* The Custom GPU node (L40s) has this<br>
>>>>>>> kernel cnavarro at nodeGPU02:~$ uname -a<br>
>>>>>>> Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14<br>
>>>>>>> 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux<br>
>>>>>>><br>
>>>>>>> *And the login node *(slurmctld)<br>
>>>>>>> ? ~ uname -a<br>
>>>>>>> Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue<br>
>>>>>>> Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux<br>
>>>>>>><br>
>>>>>>> Any ideas what we should check?<br>
>>>>>>><br>
>>>>>>> On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider <tim.schneider1<br>
>>>>>>> at <a href="http://tu-darmstadt.de" rel="noreferrer" target="_blank">tu-darmstadt.de</a>> wrote:<br>
>>>>>>> <br>
>>>>>>>> Hi,<br>
>>>>>>>><br>
>>>>>>>> I am using SLURM 22.05.9 on a small compute cluster. Since I<br>
>>>>>>>> reinstalled two of our nodes, I get the following error when<br>
>>>>>>>> launching a job:<br>
>>>>>>>><br>
>>>>>>>> slurmstepd: error: load_ebpf_prog: BPF load error (No space<br>
>>>>>>>> left on device). Please check your system limits (MEMLOCK).<br>
>>>>>>>><br>
>>>>>>>> Also the cgroups do not seem to work properly anymore, as I<br>
>>>>>>>> am able to see all GPUs even if I do not request them,<br>
>>>>>>>> which is not the case on the other nodes.<br>
>>>>>>>><br>
>>>>>>>> One difference I found between the updated nodes and the<br>
>>>>>>>> original nodes (both are Ubuntu 22.04) is the kernel<br>
>>>>>>>> version, which is "5.15.0-89-generic #99-Ubuntu SMP" on the<br>
>>>>>>>> functioning nodes and "5.15.0-91-generic #101-Ubuntu SMP"<br>
>>>>>>>> on the updated nodes. I could not figure out how to install<br>
>>>>>>>> the exact first kernel version on the updated nodes, but I<br>
>>>>>>>> noticed that when I reinstall 5.15.0 with this tool:<br>
>>>>>>>> <a href="https://github.com/pimlie/ubuntu-mainline-kernel.sh" rel="noreferrer" target="_blank">https://github.com/pimlie/ubuntu-mainline-kernel.sh</a>, the<br>
>>>>>>>> error message disappears. However, once I do that, the<br>
>>>>>>>> network driver does not function properly anymore, so this<br>
>>>>>>>> does not seem to be a good solution.<br>
>>>>>>>><br>
>>>>>>>> Has anyone seen this issue before or is there maybe<br>
>>>>>>>> something else I should take a look at? I am also happy to<br>
>>>>>>>> just find a workaround such that I can take these nodes<br>
>>>>>>>> back online.<br>
>>>>>>>><br>
>>>>>>>> I appreciate any help!<br>
>>>>>>>><br>
>>>>>>>> Thanks a lot in advance and best wishes,<br>
>>>>>>>><br>
>>>>>>>> Tim<br>
>>>>>>>><br>
>>>>>>>><br>
>>>>>>>> <br>
>>>> <br>
>>> <br>
>><br>
</blockquote></div><br clear="all"><br><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div>Cristóbal A. Navarro</div></div></div>