[slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).
Cristóbal Navarro
cristobal.navarro.g at gmail.com
Mon Jan 22 16:31:15 UTC 2024
Hi Tim and community,
We are currently having the same issue (cgroups not working it seems,
showing all GPUs on jobs) on a GPU-compute node (DGX A100) a couple of days
ago after a full update (apt upgrade). Now whenever we launch a job for
that partition, we get the error message mentioned by Tim. As a note, we
have another custom GPU-compute node with L40s, on a different partition,
and that one works fine.
Before this error, we always had small differences in kernel version
between nodes, so I am not sure if this can be the problem. Nevertheless,
here is the info of our nodes as well.
*[Problem node]* The DGX A100 node has this kernel
cnavarro at nodeGPU01:~$ uname -a
Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30 UTC
2023 x86_64 x86_64 x86_64 GNU/Linux
*[Functioning node]* The Custom GPU node (L40s) has this kernel
cnavarro at nodeGPU02:~$ uname -a
Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC
2023 x86_64 x86_64 x86_64 GNU/Linux
*And the login node *(slurmctld)
➜ ~ uname -a
Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08
UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Any ideas what we should check?
On Thu, Jan 4, 2024 at 3:03 PM Tim Schneider <tim.schneider1 at tu-darmstadt.de>
wrote:
> Hi,
>
> I am using SLURM 22.05.9 on a small compute cluster. Since I reinstalled
> two of our nodes, I get the following error when launching a job:
>
> slurmstepd: error: load_ebpf_prog: BPF load error (No space left on
> device). Please check your system limits (MEMLOCK).
>
> Also the cgroups do not seem to work properly anymore, as I am able to
> see all GPUs even if I do not request them, which is not the case on the
> other nodes.
>
> One difference I found between the updated nodes and the original nodes
> (both are Ubuntu 22.04) is the kernel version, which is
> "5.15.0-89-generic #99-Ubuntu SMP" on the functioning nodes and
> "5.15.0-91-generic #101-Ubuntu SMP" on the updated nodes. I could not
> figure out how to install the exact first kernel version on the updated
> nodes, but I noticed that when I reinstall 5.15.0 with this tool:
> https://github.com/pimlie/ubuntu-mainline-kernel.sh, the error message
> disappears. However, once I do that, the network driver does not
> function properly anymore, so this does not seem to be a good solution.
>
> Has anyone seen this issue before or is there maybe something else I
> should take a look at? I am also happy to just find a workaround such
> that I can take these nodes back online.
>
> I appreciate any help!
>
> Thanks a lot in advance and best wishes,
>
> Tim
>
>
>
--
Cristóbal A. Navarro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20240122/eaaf3eff/attachment.htm>
More information about the slurm-users
mailing list