[slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5
ralf.utermann at physik.uni-augsburg.de
ralf.utermann at physik.uni-augsburg.de
Thu Jul 27 12:08:50 UTC 2023
Am 26.07.23 um 11:38 schrieb Ralf Utermann:
> Am 25.07.23 um 02:09 schrieb Cristóbal Navarro:
>> Hello Angel and Community,
>> I am facing a similar problem with a DGX A100 with DGX OS 6 (Based on Ubuntu 22.04 LTS) and Slurm 23.02.
>> When I execute `slurmd` service, it status shows failed with the following information below.
> Hello Cristobal,
> we see similar problems not on DGX but standard server nodes running
> Ubuntu 22.04 (kernel 5.15.0-76-generic) and Slurm 23.02.3.
> The first start of the slurmd service always fails, with lots of errors
> in the slurmd.log like:
> error: cpu cgroup controller is not available.
> error: There's an issue initializing memory or cpu controller
> After 90 seconds this slurmd service start times out and is failed.
> BUT: One process is still running:
> /usr/local/slurm/23.02.3/sbin/slurmstepd infinity
> This looks like the process started to handle cgroup v2 as described in
> When we keep this slurmstepd infinity running, and just start
> the slurmd service a second time, everything comes up running.
> So our current workaround is: we configure the slurmd service
> with a Restart=on-failure in the [Service] section.
> Are there real solutions to this initial timeout failure?
> best regards, Ralf
>> As of today, what is the best solution to this problem? I am really not sure if the DGX A100 could fail or not by disabling cgroups v1.
>> Any suggestions are welcome.
>> ➜ slurm-23.02.3 systemctl status slurmd.service
>> × slurmd.service - Slurm node daemon
>> Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: enabled)
>> Active: failed (Result: exit-code) since Mon 2023-07-24 19:07:03 -04; 7s ago
>> Process: 3680019 ExecStart=/usr/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
>> Main PID: 3680019 (code=exited, status=1/FAILURE)
>> CPU: 40ms
>> jul 24 19:07:03 nodeGPU01 slurmd: slurmd: debug: Log file re-opened
>> jul 24 19:07:03 nodeGPU01 slurmd: slurmd: debug2: hwloc_topology_init
>> jul 24 19:07:03 nodeGPU01 slurmd: slurmd: debug2: hwloc_topology_load
>> jul 24 19:07:03 nodeGPU01 slurmd: slurmd: debug2: hwloc_topology_export_xml
>> jul 24 19:07:03 nodeGPU01 slurmd: slurmd: debug: CPUs:128 Boards:1 Sockets:2 CoresPerSocket:64 ThreadsPerCore:1
>> jul 24 19:07:03 nodeGPU01 slurmd: slurmd: Node reconfigured socket/core boundaries SocketsPerBoard=8:2(hw) CoresPerSocket=16:64(hw)
>> jul 24 19:07:03 nodeGPU01 slurmd: slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are: 2:freezer:/
>> jul 24 19:07:03 nodeGPU01 slurmd: 0::/init.scope
>> jul 24 19:07:03 nodeGPU01 systemd: slurmd.service: Main process exited, code=exited, status=1/FAILURE
>> jul 24 19:07:03 nodeGPU01 systemd: slurmd.service: Failed with result 'exit-code'.
>> ➜ slurm-23.02.3
>> On Wed, May 3, 2023 at 6:32 PM Angel de Vicente <angel.de.vicente at iac.es <mailto:angel.de.vicente at iac.es>> wrote:
>> Angel de Vicente <angel.de.vicente at iac.es <mailto:angel.de.vicente at iac.es>> writes:
>> > ,----
>> > | slurmd: fatal: Hybrid mode is not supported. Mounted cgroups are:
>> > | 5:freezer:/
>> > | 3:cpuacct:/
>> > `----
>> in the end I learnt that despite Ubuntu 22.04 reporting to be using
>> only cgroup V2, it was also using V1 and creating those mount points,
>> and then Slurm 23.02.01 was complaining that it could not work with
>> Cgroups in hybrid mode.
>> So, the "solution" (as far as you don't need V1 for some reason) was to
>> add "cgroup_no_v1=all" to the Kernel parameters and reboot: no more V1
>> mount points and Slurm was happy with that.
>> [in case somebody is interested in the future, I needed this so that I
>> could limit the resources given to users not using Slurm. We have some
>> shared workstations with many cores and users were oversubscribing the
>> CPUs, so I have installed Slurm to put some order in the executions
>> there. But these machines are not an actual cluster with a login node:
>> the login node is the same as the executing node! So with cgroups I
>> control that users connecting via ssh only have the resources equivalent
>> to 3/4 of a core (enough to edit files, etc.) until they submit their
>> jobs via Slurm, when they then get the full allocation they requested].
>> -- Ángel de Vicente
>> Research Software Engineer (Supercomputing and BigData)
>> Tel.: +34 922-605-747
>> Web.: http://research.iac.es/proyecto/polmag/ <http://research.iac.es/proyecto/polmag/>
>> GPG: 0x8BDC390B69033F52
>> Cristóbal A. Navarro
ralf.utermann at uni-a.de
More information about the slurm-users