[slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

Wed Jan 24 16:20:20 UTC 2024

Hi,

I just tested with 23.02.7-1 and the issue is gone. So it seems like the 
patch got released.

Best,

Tim

On 1/24/24 16:55, Stefan Fleischmann wrote:
> On Wed, 24 Jan 2024 12:37:04 -0300 Cristóbal Navarro
> <cristobal.navarro.g at gmail.com> wrote:
>> Many thanks
>> One question? Do we have to apply this patch (and recompile slurm i
>> guess) only on the compute-node with problems?
>> Also, I noticed the patch now appears as "obsolete", is that ok?
> We have Slurm installed on a NFS share, so what I did was to recompile
> it and then I only replaced the library lib/slurm/cgroup_v2.so. Good
> enough for now, I've been planning to update to 23.11 anyway soon.
>
> I suppose it's marked as obsolete because the patch went into a
> release. According to the info in the bug report it should have been
> included in 23.02.4.
>
> Cheers,
> Stefan
>
>> On Wed, Jan 24, 2024 at 9:52 AM Stefan Fleischmann <sfle at kth.se>
>> wrote:
>>
>>> Turns out I was wrong, this is not a problem in the kernel at all.
>>> It's a known bug that is triggered by long bpf logs, see here
>>>   https://bugs.schedmd.com/show_bug.cgi?id=17210
>>>
>>> There is a patch included there.
>>>
>>> Cheers,
>>> Stefan
>>>
>>> On Tue, 23 Jan 2024 15:28:59 +0100 Stefan Fleischmann <sfle at kth.se>
>>> wrote:
>>>> I don't think there is much for SchedMD to do. As I said since it
>>>> is working fine with newer kernels there doesn't seem to be any
>>>> breaking change in cgroup2 in general, but only a regression
>>>> introduced in one of the latest updates in 5.15.
>>>>
>>>> If Slurm was doing something wrong with cgroup2, and it
>>>> accidentally worked until this recent change, then other kernel
>>>> versions should show the same behavior. But as far as I can tell
>>>> it still works just fine with newer kernels.
>>>>
>>>> Cheers,
>>>> Stefan
>>>>
>>>> On Tue, 23 Jan 2024 15:20:56 +0100
>>>> Tim Schneider <tim.schneider1 at tu-darmstadt.de> wrote:
>>>>   
>>>>> Hi,
>>>>>
>>>>> I have filed a bug report with SchedMD
>>>>> (https://bugs.schedmd.com/show_bug.cgi?id=18623), but the
>>>>> support told me they cannot invest time in this issue since I
>>>>> don't have a support contract. Maybe they will look into it
>>>>> once it affects more people or someone important enough.
>>>>>
>>>>> So far, I have resorted to using 5.15.0-89-generic, but I am
>>>>> also a bit concerned about the security aspect of this choice.
>>>>>
>>>>> Best,
>>>>>
>>>>> Tim
>>>>>
>>>>> On 23.01.24 14:59, Stefan Fleischmann wrote:
>>>>>> Hi!
>>>>>>
>>>>>> I'm seeing the same in our environment. My conclusion is that
>>>>>> it is a regression in the Ubuntu 5.15 kernel, introduced with
>>>>>> 5.15.0-90-generic. Last working kernel version is
>>>>>> 5.15.0-89-generic. I have filed a bug report here:
>>>>>> https://bugs.launchpad.net/bugs/2050098
>>>>>>
>>>>>> Please add yourself to the affected users in the bug report
>>>>>> so it hopefully gets more attention.
>>>>>>
>>>>>> I've tested with newer kernels (6.5, 6.6 and 6.7) and the
>>>>>> problem does not exist there. 6.5 is the latest hwe kernel
>>>>>> for 22.04 and would be an option for now. Reverting back to
>>>>>> 5.15.0-89 would work as well, but I haven't looked into the
>>>>>> security aspects of that.
>>>>>>
>>>>>> Cheers,
>>>>>> Stefan
>>>>>>
>>>>>> On Mon, 22 Jan 2024 13:31:15 -0300
>>>>>> cristobal.navarro.g at gmail.com wrote:
>>>>>>   
>>>>>>> Hi Tim and community,
>>>>>>> We are currently having the same issue (cgroups not working
>>>>>>> it seems, showing all GPUs on jobs) on a GPU-compute node
>>>>>>> (DGX A100) a couple of days ago after a full update (apt
>>>>>>> upgrade). Now whenever we launch a job for that partition,
>>>>>>> we get the error message mentioned by Tim. As a note, we
>>>>>>> have another custom GPU-compute node with L40s, on a
>>>>>>> different partition, and that one works fine. Before this
>>>>>>> error, we always had small differences in kernel version
>>>>>>> between nodes, so I am not sure if this can be the problem.
>>>>>>> Nevertheless, here is the info of our nodes as well.
>>>>>>>
>>>>>>> *[Problem node]* The DGX A100 node has this kernel
>>>>>>> cnavarro at nodeGPU01:~$ uname -a
>>>>>>> Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15
>>>>>>> 20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>>
>>>>>>> *[Functioning node]* The Custom GPU node (L40s) has this
>>>>>>> kernel cnavarro at nodeGPU02:~$ uname -a
>>>>>>> Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14
>>>>>>> 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>>
>>>>>>> *And the login node *(slurmctld)
>>>>>>> ?  ~ uname -a
>>>>>>> Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue
>>>>>>> Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>>
>>>>>>> Any ideas what we should check?
>>>>>>>
>>>>>>> On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider <tim.schneider1
>>>>>>> at tu-darmstadt.de> wrote:
>>>>>>>   
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am using SLURM 22.05.9 on a small compute cluster. Since I
>>>>>>>> reinstalled two of our nodes, I get the following error when
>>>>>>>> launching a job:
>>>>>>>>
>>>>>>>> slurmstepd: error: load_ebpf_prog: BPF load error (No space
>>>>>>>> left on device). Please check your system limits (MEMLOCK).
>>>>>>>>
>>>>>>>> Also the cgroups do not seem to work properly anymore, as I
>>>>>>>> am able to see all GPUs even if I do not request them,
>>>>>>>> which is not the case on the other nodes.
>>>>>>>>
>>>>>>>> One difference I found between the updated nodes and the
>>>>>>>> original nodes (both are Ubuntu 22.04) is the kernel
>>>>>>>> version, which is "5.15.0-89-generic #99-Ubuntu SMP" on the
>>>>>>>> functioning nodes and "5.15.0-91-generic #101-Ubuntu SMP"
>>>>>>>> on the updated nodes. I could not figure out how to install
>>>>>>>> the exact first kernel version on the updated nodes, but I
>>>>>>>> noticed that when I reinstall 5.15.0 with this tool:
>>>>>>>> https://github.com/pimlie/ubuntu-mainline-kernel.sh, the
>>>>>>>> error message disappears. However, once I do that, the
>>>>>>>> network driver does not function properly anymore, so this
>>>>>>>> does not seem to be a good solution.
>>>>>>>>
>>>>>>>> Has anyone seen this issue before or is there maybe
>>>>>>>> something else I should take a look at? I am also happy to
>>>>>>>> just find a workaround such that I can take these nodes
>>>>>>>> back online.
>>>>>>>>
>>>>>>>> I appreciate any help!
>>>>>>>>
>>>>>>>> Thanks a lot in advance and best wishes,
>>>>>>>>
>>>>>>>> Tim
>>>>>>>>
>>>>>>>>
>>>>>>>>   
>>>>   
>>>   
>>