<div dir="ltr"><div>Hello Angel and Community,</div><div>I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on Ubuntu 22.04 LTS) and Slurm 23.02.</div><div>When I execute `slurmd` service, it status shows failed with the following information below. <br></div><div>As of today, what is the best solution to this problem? I am really not sure if the DGX A100 could fail or not by disabling cgroups v1.</div><div>Any suggestions are welcome.<br></div><div><br></div><div>➜ slurm-23.02.3 systemctl status slurmd.service <br>× slurmd.service - Slurm node daemon<br> Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: enabled)<br> Active: failed (Result: exit-code) since Mon 2023-07-24 19:07:03 -04; 7s ago<br> Process: 3680019 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)<br> Main PID: 3680019 (code=exited, status=1/FAILURE)<br> CPU: 40ms<br><br>jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug: Log file re-opened<br>jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_init<br>jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_load<br>jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_export_xml<br>jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug: CPUs:128 Boards:1 Sockets:2 CoresPerSocket:64 ThreadsPerCore:1<br>jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: Node reconfigured socket/core boundaries SocketsPerBoard=8:2(hw) CoresPerSocket=16:64(hw)<br>jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 2:freezer:/<br>jul 24 19:07:03 nodeGPU01 slurmd[3680019]: 0::/init.scope<br>jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE<br>jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Failed with result 'exit-code'.<br>➜ slurm-23.02.3 <br><br><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, May 3, 2023 at 6:32 PM Angel de Vicente <<a href="mailto:angel.de.vicente@iac.es">angel.de.vicente@iac.es</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello,<br>
<br>
Angel de Vicente <<a href="mailto:angel.de.vicente@iac.es" target="_blank">angel.de.vicente@iac.es</a>> writes:<br>
<br>
> ,----<br>
> | slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: <br>
> | 5:freezer:/<br>
> | 3:cpuacct:/<br>
> `----<br>
<br>
in the end I learnt that despite Ubuntu 22.04 reporting to be using<br>
only cgroup V2, it was also using V1 and creating those mount points,<br>
and then Slurm 23.02.01 was complaining that it could not work with<br>
Cgroups in hybrid mode.<br>
<br>
So, the "solution" (as far as you don't need V1 for some reason) was to<br>
add "cgroup_no_v1=all" to the Kernel parameters and reboot: no more V1<br>
mount points and Slurm was happy with that.<br>
<br>
[in case somebody is interested in the future, I needed this so that I<br>
could limit the resources given to users not using Slurm. We have some<br>
shared workstations with many cores and users were oversubscribing the<br>
CPUs, so I have installed Slurm to put some order in the executions<br>
there. But these machines are not an actual cluster with a login node:<br>
the login node is the same as the executing node! So with cgroups I<br>
control that users connecting via ssh only have the resources equivalent<br>
to 3/4 of a core (enough to edit files, etc.) until they submit their<br>
jobs via Slurm, when they then get the full allocation they requested].<br>
<br>
Cheers,<br>
-- <br>
Ángel de Vicente<br>
Research Software Engineer (Supercomputing and BigData)<br>
Tel.: +34 922-605-747<br>
Web.: <a href="http://research.iac.es/proyecto/polmag/" rel="noreferrer" target="_blank">http://research.iac.es/proyecto/polmag/</a><br>
<br>
GPG: 0x8BDC390B69033F52<br>
</blockquote></div><br clear="all"><br><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div>Cristóbal A. Navarro</div></div></div>