[slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

Cristóbal Navarro cristobal.navarro.g at gmail.com
Tue Jul 25 00:09:58 UTC 2023


Hello Angel and Community,
I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on
Ubuntu 22.04 LTS) and Slurm 23.02.
When I execute `slurmd` service, it status shows failed with the following
information below.
As of today, what is the best solution to this problem? I am really not
sure if the DGX A100 could fail or not by disabling cgroups v1.
Any suggestions are welcome.

➜  slurm-23.02.3 systemctl status slurmd.service

× slurmd.service - Slurm node daemon
     Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor
preset: enabled)
     Active: failed (Result: exit-code) since Mon 2023-07-24 19:07:03 -04;
7s ago
    Process: 3680019 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS
(code=exited, status=1/FAILURE)
   Main PID: 3680019 (code=exited, status=1/FAILURE)
        CPU: 40ms

jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug:  Log file
re-opened
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2:
hwloc_topology_init
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2:
hwloc_topology_load
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2:
hwloc_topology_export_xml
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug:  CPUs:128
Boards:1 Sockets:2 CoresPerSocket:64 ThreadsPerCore:1
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: Node reconfigured
socket/core boundaries SocketsPerBoard=8:2(hw) CoresPerSocket=16:64(hw)
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: fatal: Hybrid mode is
not supported. Mounted cgroups are: 2:freezer:/
jul 24 19:07:03 nodeGPU01 slurmd[3680019]: 0::/init.scope
jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Main process exited,
code=exited, status=1/FAILURE
jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Failed with result
'exit-code'.
➜  slurm-23.02.3



On Wed, May 3, 2023 at 6:32 PM Angel de Vicente <angel.de.vicente at iac.es>
wrote:

> Hello,
>
> Angel de Vicente <angel.de.vicente at iac.es> writes:
>
> > ,----
> > | slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are:
> > | 5:freezer:/
> > | 3:cpuacct:/
> > `----
>
> in the end I learnt that despite Ubuntu 22.04 reporting to be using
> only cgroup V2, it was also using V1 and creating those mount points,
> and then Slurm 23.02.01 was complaining that it could not work with
> Cgroups in hybrid mode.
>
> So, the "solution" (as far as you don't need V1 for some reason) was to
> add "cgroup_no_v1=all" to the Kernel parameters and reboot: no more V1
> mount points and Slurm was happy with that.
>
> [in case somebody is interested in the future, I needed this so that I
> could limit the resources given to users not using Slurm. We have some
> shared workstations with many cores and users were oversubscribing the
> CPUs, so I have installed Slurm to put some order in the executions
> there. But these machines are not an actual cluster with a login node:
> the login node is the same as the executing node! So with cgroups I
> control that users connecting via ssh only have the resources equivalent
> to 3/4 of a core (enough to edit files, etc.) until they submit their
> jobs via Slurm, when they then get the full allocation they requested].
>
> Cheers,
> --
> Ángel de Vicente
>  Research Software Engineer (Supercomputing and BigData)
>  Tel.: +34 922-605-747
>  Web.: http://research.iac.es/proyecto/polmag/
>
>  GPG: 0x8BDC390B69033F52
>


-- 
Cristóbal A. Navarro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230724/9baca190/attachment-0001.htm>


More information about the slurm-users mailing list