<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div class="elementToProof"><span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">See my comments on
</span><span style="letter-spacing: normal; font-family: "Segoe UI Web (West European)", "Segoe UI", -apple-system, BlinkMacSystemFont, Roboto, "Helvetica Neue", sans-serif; font-size: 14.6667px; font-weight: 400; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);"><a href="https://bugs.launchpad.net/bugs/2050098" id="OWA2e97ba66-eb70-a5a3-c8ab-e8153c6a36e9" class="OWAAutoLink" data-auth="NotApplicable" data-loopstyle="linkonly" style="margin: 0px; text-align: left; background-color: rgb(255, 255, 255);">https://bugs.launchpad.net/bugs/2050098</a></span><span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">.
There's a pretty simple fix in slurm.</span></div>
<div class="elementToProof"><span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);"><br>
</span></div>
<div class="elementToProof"><span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">As far as I can tell, there's nothing wrong with the slurm code. But it's using an option
that it doesn't actually need, and that seems to be causing trouble in the kernel.</span></div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div id="appendonsend"></div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<hr style="display: inline-block; width: 98%;">
<div id="divRplyFwdMsg" dir="ltr"><span style="font-family: Calibri, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);"><b>From:</b> slurm-users <slurm-users-bounces@lists.schedmd.com> on behalf of Tim Schneider <tim.schneider1@tu-darmstadt.de><br>
<b>Sent:</b> Tuesday, January 23, 2024 9:20 AM<br>
<b>To:</b> Stefan Fleischmann <sfle@kth.se>; slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com><br>
<b>Subject:</b> Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).</span>
<div> </div>
</div>
<div><span style="font-size: 11pt;">Hi,<br>
<br>
I have filed a bug report with SchedMD<br>
(<a href="https://bugs.schedmd.com/show_bug.cgi?id=18623" id="OWAdaa7cc38-f0b4-6921-f99e-9657e3ce0599" class="OWAAutoLink" data-auth="NotApplicable" data-loopstyle="linkonly">https://bugs.schedmd.com/show_bug.cgi?id=18623</a>), but the support told<br>
me they cannot invest time in this issue since I don't have a support<br>
contract. Maybe they will look into it once it affects more people or<br>
someone important enough.<br>
<br>
So far, I have resorted to using 5.15.0-89-generic, but I am also a bit<br>
concerned about the security aspect of this choice.<br>
<br>
Best,<br>
<br>
Tim<br>
<br>
On 23.01.24 14:59, Stefan Fleischmann wrote:<br>
> Hi!<br>
><br>
> I'm seeing the same in our environment. My conclusion is that it is a<br>
> regression in the Ubuntu 5.15 kernel, introduced with 5.15.0-90-generic.<br>
> Last working kernel version is 5.15.0-89-generic. I have filed a bug<br>
> report here: <a href="https://bugs.launchpad.net/bugs/2050098" id="OWA5c0148a6-3256-ede0-8246-51d35863a272" class="OWAAutoLink" data-auth="NotApplicable" data-loopstyle="linkonly">
https://bugs.launchpad.net/bugs/2050098</a><br>
><br>
> Please add yourself to the affected users in the bug report so it<br>
> hopefully gets more attention.<br>
><br>
> I've tested with newer kernels (6.5, 6.6 and 6.7) and the problem does<br>
> not exist there. 6.5 is the latest hwe kernel for 22.04 and would be an<br>
> option for now. Reverting back to 5.15.0-89 would work as well, but I<br>
> haven't looked into the security aspects of that.<br>
><br>
> Cheers,<br>
> Stefan<br>
><br>
> On Mon, 22 Jan 2024 13:31:15 -0300<br>
> cristobal.navarro.g at gmail.com wrote:<br>
><br>
>> Hi Tim and community,<br>
>> We are currently having the same issue (cgroups not working it seems,<br>
>> showing all GPUs on jobs) on a GPU-compute node (DGX A100) a couple<br>
>> of days ago after a full update (apt upgrade). Now whenever we launch<br>
>> a job for that partition, we get the error message mentioned by Tim.<br>
>> As a note, we have another custom GPU-compute node with L40s, on a<br>
>> different partition, and that one works fine.<br>
>> Before this error, we always had small differences in kernel version<br>
>> between nodes, so I am not sure if this can be the problem.<br>
>> Nevertheless, here is the info of our nodes as well.<br>
>><br>
>> *[Problem node]* The DGX A100 node has this kernel<br>
>> cnavarro at nodeGPU01:~$ uname -a<br>
>> Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30<br>
>> UTC 2023 x86_64 x86_64 x86_64 GNU/Linux<br>
>><br>
>> *[Functioning node]* The Custom GPU node (L40s) has this kernel<br>
>> cnavarro at nodeGPU02:~$ uname -a<br>
>> Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08<br>
>> UTC 2023 x86_64 x86_64 x86_64 GNU/Linux<br>
>><br>
>> *And the login node *(slurmctld)<br>
>> ? ~ uname -a<br>
>> Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14<br>
>> 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux<br>
>><br>
>> Any ideas what we should check?<br>
>><br>
>> On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider <tim.schneider1 at<br>
>> tu-darmstadt.de> wrote:<br>
>><br>
>>> Hi,<br>
>>><br>
>>> I am using SLURM 22.05.9 on a small compute cluster. Since I<br>
>>> reinstalled two of our nodes, I get the following error when<br>
>>> launching a job:<br>
>>><br>
>>> slurmstepd: error: load_ebpf_prog: BPF load error (No space left on<br>
>>> device). Please check your system limits (MEMLOCK).<br>
>>><br>
>>> Also the cgroups do not seem to work properly anymore, as I am able<br>
>>> to see all GPUs even if I do not request them, which is not the<br>
>>> case on the other nodes.<br>
>>><br>
>>> One difference I found between the updated nodes and the original<br>
>>> nodes (both are Ubuntu 22.04) is the kernel version, which is<br>
>>> "5.15.0-89-generic #99-Ubuntu SMP" on the functioning nodes and<br>
>>> "5.15.0-91-generic #101-Ubuntu SMP" on the updated nodes. I could<br>
>>> not figure out how to install the exact first kernel version on the<br>
>>> updated nodes, but I noticed that when I reinstall 5.15.0 with this<br>
>>> tool: <a href="https://github.com/pimlie/ubuntu-mainline-kernel.sh" id="OWAc46c1f20-502a-ddad-6828-83ec1e328e00" class="OWAAutoLink" data-auth="NotApplicable" data-loopstyle="linkonly">
https://github.com/pimlie/ubuntu-mainline-kernel.sh</a>, the<br>
>>> error message disappears. However, once I do that, the network<br>
>>> driver does not function properly anymore, so this does not seem to<br>
>>> be a good solution.<br>
>>><br>
>>> Has anyone seen this issue before or is there maybe something else I<br>
>>> should take a look at? I am also happy to just find a workaround<br>
>>> such that I can take these nodes back online.<br>
>>><br>
>>> I appreciate any help!<br>
>>><br>
>>> Thanks a lot in advance and best wishes,<br>
>>><br>
>>> Tim<br>
>>><br>
>>><br>
>>> <br>
<br>
</span></div>
</body>
</html>