Hi Tim and community,
We are currently having the same issue (cgroups not working it seems, showing all GPUs on jobs) on a GPU-compute node (DGX A100) a couple of days ago after a full update (apt upgrade). Now whenever we launch a job for that partition, we get the error message mentioned by Tim. As a note, we have another custom GPU-compute node with L40s, on a different partition, and that one works fine.
Before this error, we always had small differences in kernel version between nodes, so I am not sure if this can be the problem. Nevertheless, here is the info of our nodes as well.
[Problem node] The DGX A100 node has this kernel
cnavarro@nodeGPU01:~$ uname -a
Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
[Functioning node] The Custom GPU node (L40s) has this kernel
cnavarro@nodeGPU02:~$ uname -a
Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
And the login node (slurmctld)
➜ ~ uname -a
Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Any ideas what we should check?