[slurm-users] Problems with cgroupsv2

Timony, Mick Michael_Timony at hms.harvard.edu
Tue Aug 16 20:43:31 UTC 2022


When I see odd behaviour I've found it sometimes related to either NTP issues (the time is off) or munge errors:

  *   Is NTP running and is the time accurate
  *   Look for munge errors:
     *   /var/log/munge/munged.log
     *   sudo systemctl status munge

If it's a munge error, usually restarting munge does the trick:

sudo systemctl restart munge

Regards
--Mick
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Alan Orth <alan.orth at gmail.com>
Sent: Tuesday, August 16, 2022 4:36 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Problems with cgroupsv2

I re-installed SLURM 22.05.3 and then restarted slurmd and now it's working:

# dnf reinstall slurm slurm-slurmd slurm-devel slurm-pam_slurm
# systemctl restart slurmd

The dnf.log shows that the versions were the same, so there was no mismatch or anything:

2022-08-16T23:29:02+0300 DEBUG Reinstalled: slurm-22.05.3-1.el8.x86_64
2022-08-16T23:29:02+0300 DEBUG Reinstalled: slurm-devel-22.05.3-1.el8.x86_64
2022-08-16T23:29:02+0300 DEBUG Reinstalled: slurm-pam_slurm-22.05.3-1.el8.x86_64
2022-08-16T23:29:02+0300 DEBUG Reinstalled: slurm-slurmd-22.05.3-1.el8.x86_64

So I'm not sure what's going on... anyways, at least it's working now!

Regards,

On Tue, Aug 16, 2022 at 12:53 PM Alan Orth <alan.orth at gmail.com<mailto:alan.orth at gmail.com>> wrote:
Dear list,

I've been using cgroupsv2 with SLURM 22.05 on CentOS Stream 8 successfully for a few months now. Recently a few of my nodes have started having problems starting slurmd. The log shows:

[2022-08-16T20:52:58.439] slurmd version 22.05.3 started
[2022-08-16T20:52:58.439] error: Controller cpuset is not enabled!
[2022-08-16T20:52:58.439] error: Controller cpu is not enabled!
[2022-08-16T20:52:58.439] error: cpu cgroup controller is not available.
[2022-08-16T20:52:58.439] error: There's an issue initializing memory or cpu controller
[2022-08-16T20:52:58.439] error: Couldn't load specified plugin name for jobacct_gather/cgroup: Plugin init() callback failed
[2022-08-16T20:52:58.439] error: cannot create jobacct_gather context for jobacct_gather/cgroup
[2022-08-16T20:52:58.439] fatal: Unable to initialize jobacct_gather

The system has cgroupsv2 enabled as far as I can tell:

# cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma
# [ $(stat -fc %T /sys/fs/cgroup/) = "cgroup2fs" ] && echo "unified" || ( [ -e /sys/fs/cgroup/unified/ ] && echo "hybrid" || echo "legacy")
unified

And my slurm.conf has:

ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity,task/cgroup

And cgroup.conf:

CgroupAutomount=yes
CgroupPlugin=autodetect

What else should I look for before giving up and reverting to cgroupsv1? My current version is 22.05.3, but it was happening in 22.05.2 as well.

Thank you for any advice.
--
Alan Orth
alan.orth at gmail.com<mailto:alan.orth at gmail.com>
https://picturingjordan.com<https://urldefense.proofpoint.com/v2/url?u=https-3A__picturingjordan.com&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=Crq2NCkLF76f5LeQhObq0JdnDo_EKcfYlXcq0iyqQvQ&e=>
https://englishbulgaria.net<https://urldefense.proofpoint.com/v2/url?u=https-3A__englishbulgaria.net&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=K9dvD9QmS3EWZctC_BnTaz7zdTgF_t3qdDwOtYyCHL8&e=>
https://mjanja.ch<https://urldefense.proofpoint.com/v2/url?u=https-3A__mjanja.ch&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=D9vI36K8ewQZH9ZIUAAnhRMAJJNdjfbCE9WI-5KuJuU&e=>


--
Alan Orth
alan.orth at gmail.com<mailto:alan.orth at gmail.com>
https://picturingjordan.com<https://urldefense.proofpoint.com/v2/url?u=https-3A__picturingjordan.com&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=Crq2NCkLF76f5LeQhObq0JdnDo_EKcfYlXcq0iyqQvQ&e=>
https://englishbulgaria.net<https://urldefense.proofpoint.com/v2/url?u=https-3A__englishbulgaria.net&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=K9dvD9QmS3EWZctC_BnTaz7zdTgF_t3qdDwOtYyCHL8&e=>
https://mjanja.ch<https://urldefense.proofpoint.com/v2/url?u=https-3A__mjanja.ch&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=D9vI36K8ewQZH9ZIUAAnhRMAJJNdjfbCE9WI-5KuJuU&e=>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220816/d6176dd4/attachment-0001.htm>


More information about the slurm-users mailing list