See my comments on https://bugs.launchpad.net/bugs/2050098. There's a pretty simple fix in slurm.

As far as I can tell, there's nothing wrong with the slurm code. But it's using an option that it doesn't actually need, and that seems to be causing trouble in the kernel.



From: slurm-users <slurm-users-bounces@lists.schedmd.com> on behalf of Tim Schneider <tim.schneider1@tu-darmstadt.de>
Sent: Tuesday, January 23, 2024 9:20 AM
To: Stefan Fleischmann <sfle@kth.se>; slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).
 
Hi,

I have filed a bug report with SchedMD
(https://bugs.schedmd.com/show_bug.cgi?id=18623), but the support told
me they cannot invest time in this issue since I don't have a support
contract. Maybe they will look into it once it affects more people or
someone important enough.

So far, I have resorted to using 5.15.0-89-generic, but I am also a bit
concerned about the security aspect of this choice.

Best,

Tim

On 23.01.24 14:59, Stefan Fleischmann wrote:
> Hi!
>
> I'm seeing the same in our environment. My conclusion is that it is a
> regression in the Ubuntu 5.15 kernel, introduced with 5.15.0-90-generic.
> Last working kernel version is 5.15.0-89-generic. I have filed a bug
> report here: https://bugs.launchpad.net/bugs/2050098
>
> Please add yourself to the affected users in the bug report so it
> hopefully gets more attention.
>
> I've tested with newer kernels (6.5, 6.6 and 6.7) and the problem does
> not exist there. 6.5 is the latest hwe kernel for 22.04 and would be an
> option for now. Reverting back to 5.15.0-89 would work as well, but I
> haven't looked into the security aspects of that.
>
> Cheers,
> Stefan
>
> On Mon, 22 Jan 2024 13:31:15 -0300
> cristobal.navarro.g at gmail.com wrote:
>
>> Hi Tim and community,
>> We are currently having the same issue (cgroups not working it seems,
>> showing all GPUs on jobs) on a GPU-compute node (DGX A100) a couple
>> of days ago after a full update (apt upgrade). Now whenever we launch
>> a job for that partition, we get the error message mentioned by Tim.
>> As a note, we have another custom GPU-compute node with L40s, on a
>> different partition, and that one works fine.
>> Before this error, we always had small differences in kernel version
>> between nodes, so I am not sure if this can be the problem.
>> Nevertheless, here is the info of our nodes as well.
>>
>> *[Problem node]* The DGX A100 node has this kernel
>> cnavarro at nodeGPU01:~$ uname -a
>> Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30
>> UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
>>
>> *[Functioning node]* The Custom GPU node (L40s) has this kernel
>> cnavarro at nodeGPU02:~$ uname -a
>> Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08
>> UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
>>
>> *And the login node *(slurmctld)
>> ?  ~ uname -a
>> Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14
>> 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
>>
>> Any ideas what we should check?
>>
>> On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider <tim.schneider1 at
>> tu-darmstadt.de> wrote:
>>
>>> Hi,
>>>
>>> I am using SLURM 22.05.9 on a small compute cluster. Since I
>>> reinstalled two of our nodes, I get the following error when
>>> launching a job:
>>>
>>> slurmstepd: error: load_ebpf_prog: BPF load error (No space left on
>>> device). Please check your system limits (MEMLOCK).
>>>
>>> Also the cgroups do not seem to work properly anymore, as I am able
>>> to see all GPUs even if I do not request them, which is not the
>>> case on the other nodes.
>>>
>>> One difference I found between the updated nodes and the original
>>> nodes (both are Ubuntu 22.04) is the kernel version, which is
>>> "5.15.0-89-generic #99-Ubuntu SMP" on the functioning nodes and
>>> "5.15.0-91-generic #101-Ubuntu SMP" on the updated nodes. I could
>>> not figure out how to install the exact first kernel version on the
>>> updated nodes, but I noticed that when I reinstall 5.15.0 with this
>>> tool: https://github.com/pimlie/ubuntu-mainline-kernel.sh, the
>>> error message disappears. However, once I do that, the network
>>> driver does not function properly anymore, so this does not seem to
>>> be a good solution.
>>>
>>> Has anyone seen this issue before or is there maybe something else I
>>> should take a look at? I am also happy to just find a workaround
>>> such that I can take these nodes back online.
>>>
>>> I appreciate any help!
>>>
>>> Thanks a lot in advance and best wishes,
>>>
>>> Tim
>>>
>>>
>>>