Hi,
I just tested with 23.02.7-1 and the issue is gone. So it seems like the patch got released.
Best,
Tim
On 1/24/24 16:55, Stefan Fleischmann wrote:
On Wed, 24 Jan 2024 12:37:04 -0300 Cristóbal Navarro cristobal.navarro.g@gmail.com wrote:
Many thanks One question? Do we have to apply this patch (and recompile slurm i guess) only on the compute-node with problems? Also, I noticed the patch now appears as "obsolete", is that ok?
We have Slurm installed on a NFS share, so what I did was to recompile it and then I only replaced the library lib/slurm/cgroup_v2.so. Good enough for now, I've been planning to update to 23.11 anyway soon.
I suppose it's marked as obsolete because the patch went into a release. According to the info in the bug report it should have been included in 23.02.4.
Cheers, Stefan
On Wed, Jan 24, 2024 at 9:52 AM Stefan Fleischmann sfle@kth.se wrote:
Turns out I was wrong, this is not a problem in the kernel at all. It's a known bug that is triggered by long bpf logs, see here https://bugs.schedmd.com/show_bug.cgi?id=17210
There is a patch included there.
Cheers, Stefan
On Tue, 23 Jan 2024 15:28:59 +0100 Stefan Fleischmann sfle@kth.se wrote:
I don't think there is much for SchedMD to do. As I said since it is working fine with newer kernels there doesn't seem to be any breaking change in cgroup2 in general, but only a regression introduced in one of the latest updates in 5.15.
If Slurm was doing something wrong with cgroup2, and it accidentally worked until this recent change, then other kernel versions should show the same behavior. But as far as I can tell it still works just fine with newer kernels.
Cheers, Stefan
On Tue, 23 Jan 2024 15:20:56 +0100 Tim Schneider tim.schneider1@tu-darmstadt.de wrote:
Hi,
I have filed a bug report with SchedMD (https://bugs.schedmd.com/show_bug.cgi?id=18623), but the support told me they cannot invest time in this issue since I don't have a support contract. Maybe they will look into it once it affects more people or someone important enough.
So far, I have resorted to using 5.15.0-89-generic, but I am also a bit concerned about the security aspect of this choice.
Best,
Tim
On 23.01.24 14:59, Stefan Fleischmann wrote:
Hi!
I'm seeing the same in our environment. My conclusion is that it is a regression in the Ubuntu 5.15 kernel, introduced with 5.15.0-90-generic. Last working kernel version is 5.15.0-89-generic. I have filed a bug report here: https://bugs.launchpad.net/bugs/2050098
Please add yourself to the affected users in the bug report so it hopefully gets more attention.
I've tested with newer kernels (6.5, 6.6 and 6.7) and the problem does not exist there. 6.5 is the latest hwe kernel for 22.04 and would be an option for now. Reverting back to 5.15.0-89 would work as well, but I haven't looked into the security aspects of that.
Cheers, Stefan
On Mon, 22 Jan 2024 13:31:15 -0300 cristobal.navarro.g at gmail.com wrote:
> Hi Tim and community, > We are currently having the same issue (cgroups not working > it seems, showing all GPUs on jobs) on a GPU-compute node > (DGX A100) a couple of days ago after a full update (apt > upgrade). Now whenever we launch a job for that partition, > we get the error message mentioned by Tim. As a note, we > have another custom GPU-compute node with L40s, on a > different partition, and that one works fine. Before this > error, we always had small differences in kernel version > between nodes, so I am not sure if this can be the problem. > Nevertheless, here is the info of our nodes as well. > > *[Problem node]* The DGX A100 node has this kernel > cnavarro at nodeGPU01:~$ uname -a > Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 > 20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux > > *[Functioning node]* The Custom GPU node (L40s) has this > kernel cnavarro at nodeGPU02:~$ uname -a > Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 > 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux > > *And the login node *(slurmctld) > ? ~ uname -a > Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue > Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux > > Any ideas what we should check? > > On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider <tim.schneider1 > at tu-darmstadt.de> wrote: > >> Hi, >> >> I am using SLURM 22.05.9 on a small compute cluster. Since I >> reinstalled two of our nodes, I get the following error when >> launching a job: >> >> slurmstepd: error: load_ebpf_prog: BPF load error (No space >> left on device). Please check your system limits (MEMLOCK). >> >> Also the cgroups do not seem to work properly anymore, as I >> am able to see all GPUs even if I do not request them, >> which is not the case on the other nodes. >> >> One difference I found between the updated nodes and the >> original nodes (both are Ubuntu 22.04) is the kernel >> version, which is "5.15.0-89-generic #99-Ubuntu SMP" on the >> functioning nodes and "5.15.0-91-generic #101-Ubuntu SMP" >> on the updated nodes. I could not figure out how to install >> the exact first kernel version on the updated nodes, but I >> noticed that when I reinstall 5.15.0 with this tool: >> https://github.com/pimlie/ubuntu-mainline-kernel.sh, the >> error message disappears. However, once I do that, the >> network driver does not function properly anymore, so this >> does not seem to be a good solution. >> >> Has anyone seen this issue before or is there maybe >> something else I should take a look at? I am also happy to >> just find a workaround such that I can take these nodes >> back online. >> >> I appreciate any help! >> >> Thanks a lot in advance and best wishes, >> >> Tim >> >> >>
Hi, A few minutes ago recompiled the cgroups_v2 plugin from slurm with the fix included, replaced the old cgroups_v2.{a,la,so} files with the new ones on /usr/lib/slurm and now jobs work properly on that node. Many thanks for all the help. Indeed, in a few months we will update to the most recent 23.xx or 24.xx eventually.
On Wed, Jan 24, 2024 at 1:20 PM Tim Schneider < tim.schneider1@tu-darmstadt.de> wrote:
Hi,
I just tested with 23.02.7-1 and the issue is gone. So it seems like the patch got released.
Best,
Tim
On 1/24/24 16:55, Stefan Fleischmann wrote:
On Wed, 24 Jan 2024 12:37:04 -0300 Cristóbal Navarro cristobal.navarro.g@gmail.com wrote:
Many thanks One question? Do we have to apply this patch (and recompile slurm i guess) only on the compute-node with problems? Also, I noticed the patch now appears as "obsolete", is that ok?
We have Slurm installed on a NFS share, so what I did was to recompile it and then I only replaced the library lib/slurm/cgroup_v2.so. Good enough for now, I've been planning to update to 23.11 anyway soon.
I suppose it's marked as obsolete because the patch went into a release. According to the info in the bug report it should have been included in 23.02.4.
Cheers, Stefan
On Wed, Jan 24, 2024 at 9:52 AM Stefan Fleischmann sfle@kth.se wrote:
Turns out I was wrong, this is not a problem in the kernel at all. It's a known bug that is triggered by long bpf logs, see here https://bugs.schedmd.com/show_bug.cgi?id=17210
There is a patch included there.
Cheers, Stefan
On Tue, 23 Jan 2024 15:28:59 +0100 Stefan Fleischmann sfle@kth.se wrote:
I don't think there is much for SchedMD to do. As I said since it is working fine with newer kernels there doesn't seem to be any breaking change in cgroup2 in general, but only a regression introduced in one of the latest updates in 5.15.
If Slurm was doing something wrong with cgroup2, and it accidentally worked until this recent change, then other kernel versions should show the same behavior. But as far as I can tell it still works just fine with newer kernels.
Cheers, Stefan
On Tue, 23 Jan 2024 15:20:56 +0100 Tim Schneider tim.schneider1@tu-darmstadt.de wrote:
Hi,
I have filed a bug report with SchedMD (https://bugs.schedmd.com/show_bug.cgi?id=18623), but the support told me they cannot invest time in this issue since I don't have a support contract. Maybe they will look into it once it affects more people or someone important enough.
So far, I have resorted to using 5.15.0-89-generic, but I am also a bit concerned about the security aspect of this choice.
Best,
Tim
On 23.01.24 14:59, Stefan Fleischmann wrote: > Hi! > > I'm seeing the same in our environment. My conclusion is that > it is a regression in the Ubuntu 5.15 kernel, introduced with > 5.15.0-90-generic. Last working kernel version is > 5.15.0-89-generic. I have filed a bug report here: > https://bugs.launchpad.net/bugs/2050098 > > Please add yourself to the affected users in the bug report > so it hopefully gets more attention. > > I've tested with newer kernels (6.5, 6.6 and 6.7) and the > problem does not exist there. 6.5 is the latest hwe kernel > for 22.04 and would be an option for now. Reverting back to > 5.15.0-89 would work as well, but I haven't looked into the > security aspects of that. > > Cheers, > Stefan > > On Mon, 22 Jan 2024 13:31:15 -0300 > cristobal.navarro.g at gmail.com wrote: > >> Hi Tim and community, >> We are currently having the same issue (cgroups not working >> it seems, showing all GPUs on jobs) on a GPU-compute node >> (DGX A100) a couple of days ago after a full update (apt >> upgrade). Now whenever we launch a job for that partition, >> we get the error message mentioned by Tim. As a note, we >> have another custom GPU-compute node with L40s, on a >> different partition, and that one works fine. Before this >> error, we always had small differences in kernel version >> between nodes, so I am not sure if this can be the problem. >> Nevertheless, here is the info of our nodes as well. >> >> *[Problem node]* The DGX A100 node has this kernel >> cnavarro at nodeGPU01:~$ uname -a >> Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 >> 20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux >> >> *[Functioning node]* The Custom GPU node (L40s) has this >> kernel cnavarro at nodeGPU02:~$ uname -a >> Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 >> 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux >> >> *And the login node *(slurmctld) >> ? ~ uname -a >> Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue >> Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux >> >> Any ideas what we should check? >> >> On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider <tim.schneider1 >> at tu-darmstadt.de> wrote: >> >>> Hi, >>> >>> I am using SLURM 22.05.9 on a small compute cluster. Since I >>> reinstalled two of our nodes, I get the following error when >>> launching a job: >>> >>> slurmstepd: error: load_ebpf_prog: BPF load error (No space >>> left on device). Please check your system limits (MEMLOCK). >>> >>> Also the cgroups do not seem to work properly anymore, as I >>> am able to see all GPUs even if I do not request them, >>> which is not the case on the other nodes. >>> >>> One difference I found between the updated nodes and the >>> original nodes (both are Ubuntu 22.04) is the kernel >>> version, which is "5.15.0-89-generic #99-Ubuntu SMP" on the >>> functioning nodes and "5.15.0-91-generic #101-Ubuntu SMP" >>> on the updated nodes. I could not figure out how to install >>> the exact first kernel version on the updated nodes, but I >>> noticed that when I reinstall 5.15.0 with this tool: >>> https://github.com/pimlie/ubuntu-mainline-kernel.sh, the >>> error message disappears. However, once I do that, the >>> network driver does not function properly anymore, so this >>> does not seem to be a good solution. >>> >>> Has anyone seen this issue before or is there maybe >>> something else I should take a look at? I am also happy to >>> just find a workaround such that I can take these nodes >>> back online. >>> >>> I appreciate any help! >>> >>> Thanks a lot in advance and best wishes, >>> >>> Tim >>> >>> >>>