[slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

Thu Jul 27 12:08:50 UTC 2023

Am 26.07.23 um 11:38 schrieb Ralf Utermann:
> Am 25.07.23 um 02:09 schrieb Cristóbal Navarro:
>> Hello Angel and Community,
>> I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on Ubuntu 22.04 LTS) and Slurm 23.02.
>> When I execute `slurmd` service, it status shows failed with the following information below.
> 
> Hello Cristobal,
> 
> we see similar problems not on DGX but standard server nodes running
> Ubuntu 22.04 (kernel 5.15.0-76-generic) and Slurm 23.02.3.
> 
> The first start of the slurmd service always fails, with lots of errors
> in the slurmd.log like:
>    error: cpu cgroup controller is not available.
>    error: There's an issue initializing memory or cpu controller
> After 90 seconds this slurmd service start times out and is failed.
> 
> BUT: One process is still running:
>    /usr/local/slurm/23.02.3/sbin/slurmstepd infinity
> 
> This looks like the process started to handle cgroup v2 as described in
>    https://slurm.schedmd.com/cgroup_v2.html
> 
> When we keep this slurmstepd infinity running, and just start
> the slurmd service a second time, everything comes up running.
> 
> So our current workaround is: we configure the slurmd service
> with a Restart=on-failure in the [Service] section.
> 
> 
> Are there real solutions to this initial timeout failure?
> 
> best regards, Ralf
> 
> 
> 
>> As of today, what is the best solution to this problem? I am really not sure if the DGX A100 could fail or not by disabling cgroups v1.
>> Any suggestions are welcome.
>>
>> ➜  slurm-23.02.3 systemctl status slurmd.service
>> × slurmd.service - Slurm node daemon
>>       Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: enabled)
>>       Active: failed (Result: exit-code) since Mon 2023-07-24 19:07:03 -04; 7s ago
>>      Process: 3680019 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
>>     Main PID: 3680019 (code=exited, status=1/FAILURE)
>>          CPU: 40ms
>>
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug:  Log file re-opened
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_init
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_load
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug2: hwloc_topology_export_xml
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: debug:  CPUs:128 Boards:1 Sockets:2 CoresPerSocket:64 ThreadsPerCore:1
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: Node reconfigured socket/core boundaries SocketsPerBoard=8:2(hw) CoresPerSocket=16:64(hw)
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 2:freezer:/
>> jul 24 19:07:03 nodeGPU01 slurmd[3680019]: 0::/init.scope
>> jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
>> jul 24 19:07:03 nodeGPU01 systemd[1]: slurmd.service: Failed with result 'exit-code'.
>> ➜  slurm-23.02.3
>>
>>
>>
>> On Wed, May 3, 2023 at 6:32 PM Angel de Vicente <angel.de.vicente at iac.es <mailto:angel.de.vicente at iac.es>> wrote:
>>
>>     Hello,
>>
>>     Angel de Vicente <angel.de.vicente at iac.es <mailto:angel.de.vicente at iac.es>> writes:
>>
>>      > ,----
>>      > | slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are:
>>      > | 5:freezer:/
>>      > | 3:cpuacct:/
>>      > `----
>>
>>     in the end I learnt that despite Ubuntu 22.04 reporting to be using
>>     only cgroup V2, it was also using V1 and creating those mount points,
>>     and then Slurm 23.02.01 was complaining that it could not work with
>>     Cgroups in hybrid mode.
>>
>>     So, the "solution" (as far as you don't need V1 for some reason) was to
>>     add "cgroup_no_v1=all" to the Kernel parameters and reboot: no more V1
>>     mount points and Slurm was happy with that.
>>
>>     [in case somebody is interested in the future, I needed this so that I
>>     could limit the resources given to users not using Slurm. We have some
>>     shared workstations with many cores and users were oversubscribing the
>>     CPUs, so I have installed Slurm to put some order in the executions
>>     there. But these machines are not an actual cluster with a login node:
>>     the login node is the same as the executing node! So with cgroups I
>>     control that users connecting via ssh only have the resources equivalent
>>     to 3/4 of a core (enough to edit files, etc.) until they submit their
>>     jobs via Slurm, when they then get the full allocation they requested].
>>
>>     Cheers,
>>     --     Ángel de Vicente
>>       Research Software Engineer (Supercomputing and BigData)
>>       Tel.: +34 922-605-747
>>       Web.: http://research.iac.es/proyecto/polmag/ <http://research.iac.es/proyecto/polmag/>
>>
>>       GPG: 0x8BDC390B69033F52
>>
>>
>>
>> -- 
>> Cristóbal A. Navarro
> 

-- 
         Ralf Utermann

         Universität Augsburg
         Rechenzentrum
         D-86135 Augsburg

         ralf.utermann at uni-a.de
         https://www.rz.uni-augsburg.de