[slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

Cristóbal Navarro cristobal.navarro.g at gmail.com
Wed Jan 24 16:33:03 UTC 2024


Hi,
A few minutes ago recompiled the cgroups_v2 plugin from slurm with the fix
included, replaced the old cgroups_v2.{a,la,so} files with the new ones on
/usr/lib/slurm and now jobs work properly on that node.
Many thanks for all the help. Indeed, in a few months we will update to the
most recent 23.xx or 24.xx eventually.

On Wed, Jan 24, 2024 at 1:20 PM Tim Schneider <
tim.schneider1 at tu-darmstadt.de> wrote:

> Hi,
>
> I just tested with 23.02.7-1 and the issue is gone. So it seems like the
> patch got released.
>
> Best,
>
> Tim
>
> On 1/24/24 16:55, Stefan Fleischmann wrote:
> > On Wed, 24 Jan 2024 12:37:04 -0300 Cristóbal Navarro
> > <cristobal.navarro.g at gmail.com> wrote:
> >> Many thanks
> >> One question? Do we have to apply this patch (and recompile slurm i
> >> guess) only on the compute-node with problems?
> >> Also, I noticed the patch now appears as "obsolete", is that ok?
> > We have Slurm installed on a NFS share, so what I did was to recompile
> > it and then I only replaced the library lib/slurm/cgroup_v2.so. Good
> > enough for now, I've been planning to update to 23.11 anyway soon.
> >
> > I suppose it's marked as obsolete because the patch went into a
> > release. According to the info in the bug report it should have been
> > included in 23.02.4.
> >
> > Cheers,
> > Stefan
> >
> >> On Wed, Jan 24, 2024 at 9:52 AM Stefan Fleischmann <sfle at kth.se>
> >> wrote:
> >>
> >>> Turns out I was wrong, this is not a problem in the kernel at all.
> >>> It's a known bug that is triggered by long bpf logs, see here
> >>>   https://bugs.schedmd.com/show_bug.cgi?id=17210
> >>>
> >>> There is a patch included there.
> >>>
> >>> Cheers,
> >>> Stefan
> >>>
> >>> On Tue, 23 Jan 2024 15:28:59 +0100 Stefan Fleischmann <sfle at kth.se>
> >>> wrote:
> >>>> I don't think there is much for SchedMD to do. As I said since it
> >>>> is working fine with newer kernels there doesn't seem to be any
> >>>> breaking change in cgroup2 in general, but only a regression
> >>>> introduced in one of the latest updates in 5.15.
> >>>>
> >>>> If Slurm was doing something wrong with cgroup2, and it
> >>>> accidentally worked until this recent change, then other kernel
> >>>> versions should show the same behavior. But as far as I can tell
> >>>> it still works just fine with newer kernels.
> >>>>
> >>>> Cheers,
> >>>> Stefan
> >>>>
> >>>> On Tue, 23 Jan 2024 15:20:56 +0100
> >>>> Tim Schneider <tim.schneider1 at tu-darmstadt.de> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I have filed a bug report with SchedMD
> >>>>> (https://bugs.schedmd.com/show_bug.cgi?id=18623), but the
> >>>>> support told me they cannot invest time in this issue since I
> >>>>> don't have a support contract. Maybe they will look into it
> >>>>> once it affects more people or someone important enough.
> >>>>>
> >>>>> So far, I have resorted to using 5.15.0-89-generic, but I am
> >>>>> also a bit concerned about the security aspect of this choice.
> >>>>>
> >>>>> Best,
> >>>>>
> >>>>> Tim
> >>>>>
> >>>>> On 23.01.24 14:59, Stefan Fleischmann wrote:
> >>>>>> Hi!
> >>>>>>
> >>>>>> I'm seeing the same in our environment. My conclusion is that
> >>>>>> it is a regression in the Ubuntu 5.15 kernel, introduced with
> >>>>>> 5.15.0-90-generic. Last working kernel version is
> >>>>>> 5.15.0-89-generic. I have filed a bug report here:
> >>>>>> https://bugs.launchpad.net/bugs/2050098
> >>>>>>
> >>>>>> Please add yourself to the affected users in the bug report
> >>>>>> so it hopefully gets more attention.
> >>>>>>
> >>>>>> I've tested with newer kernels (6.5, 6.6 and 6.7) and the
> >>>>>> problem does not exist there. 6.5 is the latest hwe kernel
> >>>>>> for 22.04 and would be an option for now. Reverting back to
> >>>>>> 5.15.0-89 would work as well, but I haven't looked into the
> >>>>>> security aspects of that.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Stefan
> >>>>>>
> >>>>>> On Mon, 22 Jan 2024 13:31:15 -0300
> >>>>>> cristobal.navarro.g at gmail.com wrote:
> >>>>>>
> >>>>>>> Hi Tim and community,
> >>>>>>> We are currently having the same issue (cgroups not working
> >>>>>>> it seems, showing all GPUs on jobs) on a GPU-compute node
> >>>>>>> (DGX A100) a couple of days ago after a full update (apt
> >>>>>>> upgrade). Now whenever we launch a job for that partition,
> >>>>>>> we get the error message mentioned by Tim. As a note, we
> >>>>>>> have another custom GPU-compute node with L40s, on a
> >>>>>>> different partition, and that one works fine. Before this
> >>>>>>> error, we always had small differences in kernel version
> >>>>>>> between nodes, so I am not sure if this can be the problem.
> >>>>>>> Nevertheless, here is the info of our nodes as well.
> >>>>>>>
> >>>>>>> *[Problem node]* The DGX A100 node has this kernel
> >>>>>>> cnavarro at nodeGPU01:~$ uname -a
> >>>>>>> Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15
> >>>>>>> 20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
> >>>>>>>
> >>>>>>> *[Functioning node]* The Custom GPU node (L40s) has this
> >>>>>>> kernel cnavarro at nodeGPU02:~$ uname -a
> >>>>>>> Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14
> >>>>>>> 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
> >>>>>>>
> >>>>>>> *And the login node *(slurmctld)
> >>>>>>> ?  ~ uname -a
> >>>>>>> Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue
> >>>>>>> Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
> >>>>>>>
> >>>>>>> Any ideas what we should check?
> >>>>>>>
> >>>>>>> On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider <tim.schneider1
> >>>>>>> at tu-darmstadt.de> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I am using SLURM 22.05.9 on a small compute cluster. Since I
> >>>>>>>> reinstalled two of our nodes, I get the following error when
> >>>>>>>> launching a job:
> >>>>>>>>
> >>>>>>>> slurmstepd: error: load_ebpf_prog: BPF load error (No space
> >>>>>>>> left on device). Please check your system limits (MEMLOCK).
> >>>>>>>>
> >>>>>>>> Also the cgroups do not seem to work properly anymore, as I
> >>>>>>>> am able to see all GPUs even if I do not request them,
> >>>>>>>> which is not the case on the other nodes.
> >>>>>>>>
> >>>>>>>> One difference I found between the updated nodes and the
> >>>>>>>> original nodes (both are Ubuntu 22.04) is the kernel
> >>>>>>>> version, which is "5.15.0-89-generic #99-Ubuntu SMP" on the
> >>>>>>>> functioning nodes and "5.15.0-91-generic #101-Ubuntu SMP"
> >>>>>>>> on the updated nodes. I could not figure out how to install
> >>>>>>>> the exact first kernel version on the updated nodes, but I
> >>>>>>>> noticed that when I reinstall 5.15.0 with this tool:
> >>>>>>>> https://github.com/pimlie/ubuntu-mainline-kernel.sh, the
> >>>>>>>> error message disappears. However, once I do that, the
> >>>>>>>> network driver does not function properly anymore, so this
> >>>>>>>> does not seem to be a good solution.
> >>>>>>>>
> >>>>>>>> Has anyone seen this issue before or is there maybe
> >>>>>>>> something else I should take a look at? I am also happy to
> >>>>>>>> just find a workaround such that I can take these nodes
> >>>>>>>> back online.
> >>>>>>>>
> >>>>>>>> I appreciate any help!
> >>>>>>>>
> >>>>>>>> Thanks a lot in advance and best wishes,
> >>>>>>>>
> >>>>>>>> Tim
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>
> >>>
> >>
>


-- 
Cristóbal A. Navarro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20240124/3f1acc44/attachment-0001.htm>


More information about the slurm-users mailing list