[slurm-users] Problems with cgroupsv2

Alan Orth alan.orth at gmail.com
Tue Aug 16 21:36:49 UTC 2022


Thanks for the advice. I checked munge's log on the system that was most
recently affected and found a few hundred of these:

2022-08-16 23:30:56 +0300 Info:      Unauthorized credential for client
UID=0 GID=0

Not sure if relevant. NTP on the system is synced. I'll keep an eye on
munge in the future...

Thanks again,

On Tue, Aug 16, 2022 at 1:45 PM Timony, Mick <Michael_Timony at hms.harvard.edu>
wrote:

> When I see odd behaviour I've found it sometimes related to either NTP
> issues (the time is off) or munge errors:
>
>    - Is NTP running and is the time accurate
>    - Look for munge errors:
>       - /var/log/munge/munged.log
>       - sudo systemctl status munge
>
> If it's a munge error, usually restarting munge does the trick:
>
> sudo systemctl restart munge
>
> Regards
> --Mick
> ------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Alan Orth <alan.orth at gmail.com>
> *Sent:* Tuesday, August 16, 2022 4:36 PM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] Problems with cgroupsv2
>
> I re-installed SLURM 22.05.3 and then restarted slurmd and now it's
> working:
>
> # dnf reinstall slurm slurm-slurmd slurm-devel slurm-pam_slurm
> # systemctl restart slurmd
>
> The dnf.log shows that the versions were the same, so there was no
> mismatch or anything:
>
> 2022-08-16T23:29:02+0300 DEBUG Reinstalled: slurm-22.05.3-1.el8.x86_64
> 2022-08-16T23:29:02+0300 DEBUG Reinstalled:
> slurm-devel-22.05.3-1.el8.x86_64
> 2022-08-16T23:29:02+0300 DEBUG Reinstalled:
> slurm-pam_slurm-22.05.3-1.el8.x86_64
> 2022-08-16T23:29:02+0300 DEBUG Reinstalled:
> slurm-slurmd-22.05.3-1.el8.x86_64
>
> So I'm not sure what's going on... anyways, at least it's working now!
>
> Regards,
>
> On Tue, Aug 16, 2022 at 12:53 PM Alan Orth <alan.orth at gmail.com> wrote:
>
> Dear list,
>
> I've been using cgroupsv2 with SLURM 22.05 on CentOS Stream 8 successfully
> for a few months now. Recently a few of my nodes have started having
> problems starting slurmd. The log shows:
>
> [2022-08-16T20:52:58.439] slurmd version 22.05.3 started
> [2022-08-16T20:52:58.439] error: Controller cpuset is not enabled!
> [2022-08-16T20:52:58.439] error: Controller cpu is not enabled!
> [2022-08-16T20:52:58.439] error: cpu cgroup controller is not available.
> [2022-08-16T20:52:58.439] error: There's an issue initializing memory or
> cpu controller
> [2022-08-16T20:52:58.439] error: Couldn't load specified plugin name for
> jobacct_gather/cgroup: Plugin init() callback failed
> [2022-08-16T20:52:58.439] error: cannot create jobacct_gather context for
> jobacct_gather/cgroup
> [2022-08-16T20:52:58.439] fatal: Unable to initialize jobacct_gather
>
> The system has cgroupsv2 enabled as far as I can tell:
>
> # cat /sys/fs/cgroup/cgroup.controllers
> cpuset cpu io memory hugetlb pids rdma
> # [ $(stat -fc %T /sys/fs/cgroup/) = "cgroup2fs" ] && echo "unified" || (
> [ -e /sys/fs/cgroup/unified/ ] && echo "hybrid" || echo "legacy")
> unified
>
> And my slurm.conf has:
>
> ProctrackType=proctrack/cgroup
> TaskPlugin=task/affinity,task/cgroup
>
> And cgroup.conf:
>
> CgroupAutomount=yes
> CgroupPlugin=autodetect
>
> What else should I look for before giving up and reverting to cgroupsv1?
> My current version is 22.05.3, but it was happening in 22.05.2 as well.
>
> Thank you for any advice.
> --
> Alan Orth
> alan.orth at gmail.com
> https://picturingjordan.com
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__picturingjordan.com&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=Crq2NCkLF76f5LeQhObq0JdnDo_EKcfYlXcq0iyqQvQ&e=>
> https://englishbulgaria.net
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__englishbulgaria.net&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=K9dvD9QmS3EWZctC_BnTaz7zdTgF_t3qdDwOtYyCHL8&e=>
> https://mjanja.ch
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__mjanja.ch&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=D9vI36K8ewQZH9ZIUAAnhRMAJJNdjfbCE9WI-5KuJuU&e=>
>
>
>
> --
> Alan Orth
> alan.orth at gmail.com
> https://picturingjordan.com
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__picturingjordan.com&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=Crq2NCkLF76f5LeQhObq0JdnDo_EKcfYlXcq0iyqQvQ&e=>
> https://englishbulgaria.net
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__englishbulgaria.net&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=K9dvD9QmS3EWZctC_BnTaz7zdTgF_t3qdDwOtYyCHL8&e=>
> https://mjanja.ch
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__mjanja.ch&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=D9vI36K8ewQZH9ZIUAAnhRMAJJNdjfbCE9WI-5KuJuU&e=>
>


-- 
Alan Orth
alan.orth at gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220816/b511d0e5/attachment.htm>


More information about the slurm-users mailing list