[slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

Wed Jan 24 16:15:39 UTC 2024

Since they took the patch, it's not needed if you're using the version they fixed. However it looks like they haven't released that version yet. The patch is to slurmd. You don't need it on the controller. If you're only having problems with some systems, you can put it just on those systems, but if this is ubuntu, unless you prevent kernel upgrades, any system with 5.15 will eventually end up with a kernel that has the problem. So I'd put it on all systems with 5.15. I can't comment on other releases.

Put it in the source and do "make install", assuming you've already built from source. If you haven't built from source before, you will have to configure it, which may have differences from site to site. Here's what we do:

Put the source in  /usr/src/{{slurm_version}}

apt install libssl-dev libbz2-dev zlib1g-dev pkg-config xz-utils libhwloc-dev libfreeipmi-dev liblua5.3-dev libmariadb-dev libpam0g-dev libnuma-dev librrd-dev libgtk2.0-dev libhttp-parser-dev libjson-c-dev libyaml-dev libyaml-dev libjwt-dev liblz4-dev libcurl4-gnutls-dev libipmimonitoring-dev libpmix-dev mariadb-client

umask 022; cd /usr/src/{{slurm_version}}; ./configure --prefix=/usr/local/{{slurm_version}} --with-nvml=/usr/local/cuda-11.1 --with-pmix=/usr/lib/x86_64-linux-gnu/pmix; make; make install; /usr/sbin/ldconfig

That ends up with binaries in /usr/local/{slurm_version}}/bin. They are symlinked into /usr/local/bin.

This is for Ubuntu 22.04 and slurm 23.02.3. The exact dependencies seem to change both with Ubuntu and slurm changes. (I have an ansible role that does a full build of all the pieces, though it may not apply to you because we use kerberos, and so I have to make sure the the kerberos credentials are captured when you do srun or sbatch, and passed to the system where the job runs).

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Cristóbal Navarro <cristobal.navarro.g at gmail.com>
Sent: Wednesday, January 24, 2024 10:37 AM
To: Stefan Fleischmann <sfle at kth.se>
Cc: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

Many thanks
One question? Do we have to apply this patch (and recompile slurm i guess) only on the compute-node with problems?
Also, I noticed the patch now appears as "obsolete", is that ok?

On Wed, Jan 24, 2024 at 9:52 AM Stefan Fleischmann <sfle at kth.se<mailto:sfle at kth.se>> wrote:
Turns out I was wrong, this is not a problem in the kernel at all. It's
a known bug that is triggered by long bpf logs, see here
 https://bugs.schedmd.com/show_bug.cgi?id=17210

There is a patch included there.

Cheers,
Stefan

On Tue, 23 Jan 2024 15:28:59 +0100 Stefan Fleischmann <sfle at kth.se<mailto:sfle at kth.se>>
wrote:
> I don't think there is much for SchedMD to do. As I said since it is
> working fine with newer kernels there doesn't seem to be any breaking
> change in cgroup2 in general, but only a regression introduced in one
> of the latest updates in 5.15.
>
> If Slurm was doing something wrong with cgroup2, and it accidentally
> worked until this recent change, then other kernel versions should
> show the same behavior. But as far as I can tell it still works just
> fine with newer kernels.
>
> Cheers,
> Stefan
>
> On Tue, 23 Jan 2024 15:20:56 +0100
> Tim Schneider <tim.schneider1 at tu-darmstadt.de<mailto:tim.schneider1 at tu-darmstadt.de>> wrote:
>
> > Hi,
> >
> > I have filed a bug report with SchedMD
> > (https://bugs.schedmd.com/show_bug.cgi?id=18623), but the support
> > told me they cannot invest time in this issue since I don't have a
> > support contract. Maybe they will look into it once it affects more
> > people or someone important enough.
> >
> > So far, I have resorted to using 5.15.0-89-generic, but I am also a
> > bit concerned about the security aspect of this choice.
> >
> > Best,
> >
> > Tim
> >
> > On 23.01.24 14:59, Stefan Fleischmann wrote:
> > > Hi!
> > >
> > > I'm seeing the same in our environment. My conclusion is that it
> > > is a regression in the Ubuntu 5.15 kernel, introduced with
> > > 5.15.0-90-generic. Last working kernel version is
> > > 5.15.0-89-generic. I have filed a bug report here:
> > > https://bugs.launchpad.net/bugs/2050098
> > >
> > > Please add yourself to the affected users in the bug report so it
> > > hopefully gets more attention.
> > >
> > > I've tested with newer kernels (6.5, 6.6 and 6.7) and the problem
> > > does not exist there. 6.5 is the latest hwe kernel for 22.04 and
> > > would be an option for now. Reverting back to 5.15.0-89 would work
> > > as well, but I haven't looked into the security aspects of that.
> > >
> > > Cheers,
> > > Stefan
> > >
> > > On Mon, 22 Jan 2024 13:31:15 -0300
> > > cristobal.navarro.g at gmail.com<http://gmail.com> wrote:
> > >
> > >> Hi Tim and community,
> > >> We are currently having the same issue (cgroups not working it
> > >> seems, showing all GPUs on jobs) on a GPU-compute node (DGX A100)
> > >> a couple of days ago after a full update (apt upgrade). Now
> > >> whenever we launch a job for that partition, we get the error
> > >> message mentioned by Tim. As a note, we have another custom
> > >> GPU-compute node with L40s, on a different partition, and that
> > >> one works fine. Before this error, we always had small
> > >> differences in kernel version between nodes, so I am not sure if
> > >> this can be the problem. Nevertheless, here is the info of our
> > >> nodes as well.
> > >>
> > >> *[Problem node]* The DGX A100 node has this kernel
> > >> cnavarro at nodeGPU01:~$ uname -a
> > >> Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15
> > >> 20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
> > >>
> > >> *[Functioning node]* The Custom GPU node (L40s) has this kernel
> > >> cnavarro at nodeGPU02:~$ uname -a
> > >> Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14
> > >> 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
> > >>
> > >> *And the login node *(slurmctld)
> > >> ?  ~ uname -a
> > >> Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14
> > >> 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
> > >>
> > >> Any ideas what we should check?
> > >>
> > >> On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider <tim.schneider1 at
> > >> tu-darmstadt.de<http://tu-darmstadt.de>> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I am using SLURM 22.05.9 on a small compute cluster. Since I
> > >>> reinstalled two of our nodes, I get the following error when
> > >>> launching a job:
> > >>>
> > >>> slurmstepd: error: load_ebpf_prog: BPF load error (No space left
> > >>> on device). Please check your system limits (MEMLOCK).
> > >>>
> > >>> Also the cgroups do not seem to work properly anymore, as I am
> > >>> able to see all GPUs even if I do not request them, which is not
> > >>> the case on the other nodes.
> > >>>
> > >>> One difference I found between the updated nodes and the
> > >>> original nodes (both are Ubuntu 22.04) is the kernel version,
> > >>> which is "5.15.0-89-generic #99-Ubuntu SMP" on the functioning
> > >>> nodes and "5.15.0-91-generic #101-Ubuntu SMP" on the updated
> > >>> nodes. I could not figure out how to install the exact first
> > >>> kernel version on the updated nodes, but I noticed that when I
> > >>> reinstall 5.15.0 with this tool:
> > >>> https://github.com/pimlie/ubuntu-mainline-kernel.sh, the error
> > >>> message disappears. However, once I do that, the network driver
> > >>> does not function properly anymore, so this does not seem to be
> > >>> a good solution.
> > >>>
> > >>> Has anyone seen this issue before or is there maybe something
> > >>> else I should take a look at? I am also happy to just find a
> > >>> workaround such that I can take these nodes back online.
> > >>>
> > >>> I appreciate any help!
> > >>>
> > >>> Thanks a lot in advance and best wishes,
> > >>>
> > >>> Tim
> > >>>
> > >>>
> > >>>
>

--
Cristóbal A. Navarro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20240124/c62a87cb/attachment-0001.htm>