<div dir="ltr"><div>Thanks for the advice. I checked munge's log on the system that was most recently affected and found a few hundred of these:</div><div><br></div><div>2022-08-16 23:30:56 +0300 Info: Unauthorized credential for client UID=0 GID=0</div><div><br></div><div>Not sure if relevant. NTP on the system is synced. I'll keep an eye on munge in the future...</div><div><br></div><div>Thanks again,<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Aug 16, 2022 at 1:45 PM Timony, Mick <<a href="mailto:Michael_Timony@hms.harvard.edu">Michael_Timony@hms.harvard.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div style="font-family:Arial,Helvetica,sans-serif;font-size:10pt;color:rgb(0,0,0);background-color:rgb(255,255,255)">
When I see odd behaviour I've found it sometimes related to either NTP issues (the time is off) or munge errors:</div>
<div style="font-family:Arial,Helvetica,sans-serif;font-size:10pt;color:rgb(0,0,0);background-color:rgb(255,255,255)">
<ul>
<li><span>Is NTP running and is the time accurate</span></li><li><span>Look for munge errors:</span></li><ul style="list-style-type:circle">
<li>/var/log/munge/munged.log</li><li>sudo systemctl status munge<br>
</li></ul>
</ul>
<div>If it's a munge error, usually restarting munge does the trick:<br>
<br>
</div>
<div>sudo systemctl restart munge<br>
</div>
<div><br>
</div>
<div>Regards</div>
<div>--Mick</div>
</div>
<div id="gmail-m_-1937668943907577257appendonsend"></div>
<hr style="display:inline-block;width:98%">
<div id="gmail-m_-1937668943907577257divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b> slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>> on behalf of Alan Orth <<a href="mailto:alan.orth@gmail.com" target="_blank">alan.orth@gmail.com</a>><br>
<b>Sent:</b> Tuesday, August 16, 2022 4:36 PM<br>
<b>To:</b> Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br>
<b>Subject:</b> Re: [slurm-users] Problems with cgroupsv2</font>
<div> </div>
</div>
<div>
<div dir="ltr">
<div>I re-installed SLURM 22.05.3 and then restarted slurmd and now it's working:</div>
<div><br>
</div>
<div># dnf reinstall slurm slurm-slurmd slurm-devel slurm-pam_slurm <br>
</div>
<div># systemctl restart slurmd</div>
<div><br>
</div>
<div>The dnf.log shows that the versions were the same, so there was no mismatch or anything:<br>
</div>
<div><br>
</div>
<div>2022-08-16T23:29:02+0300 DEBUG Reinstalled: slurm-22.05.3-1.el8.x86_64<br>
2022-08-16T23:29:02+0300 DEBUG Reinstalled: slurm-devel-22.05.3-1.el8.x86_64<br>
2022-08-16T23:29:02+0300 DEBUG Reinstalled: slurm-pam_slurm-22.05.3-1.el8.x86_64<br>
2022-08-16T23:29:02+0300 DEBUG Reinstalled: slurm-slurmd-22.05.3-1.el8.x86_64</div>
<div><br>
</div>
<div>So I'm not sure what's going on... anyways, at least it's working now!</div>
<div><br>
</div>
<div>Regards,<br>
</div>
</div>
<br>
<div>
<div dir="ltr">On Tue, Aug 16, 2022 at 12:53 PM Alan Orth <<a href="mailto:alan.orth@gmail.com" target="_blank">alan.orth@gmail.com</a>> wrote:<br>
</div>
<blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div>Dear list,</div>
<div><br>
</div>
<div>I've been using cgroupsv2 with SLURM 22.05 on CentOS Stream 8 successfully for a few months now. Recently a few of my nodes have started having problems starting slurmd. The log shows:</div>
<div><br>
</div>
<div>[2022-08-16T20:52:58.439] slurmd version 22.05.3 started<br>
[2022-08-16T20:52:58.439] error: Controller cpuset is not enabled!<br>
[2022-08-16T20:52:58.439] error: Controller cpu is not enabled!<br>
[2022-08-16T20:52:58.439] error: cpu cgroup controller is not available.<br>
[2022-08-16T20:52:58.439] error: There's an issue initializing memory or cpu controller<br>
[2022-08-16T20:52:58.439] error: Couldn't load specified plugin name for jobacct_gather/cgroup: Plugin init() callback failed<br>
[2022-08-16T20:52:58.439] error: cannot create jobacct_gather context for jobacct_gather/cgroup<br>
[2022-08-16T20:52:58.439] fatal: Unable to initialize jobacct_gather</div>
<div><br>
</div>
<div>The system has cgroupsv2 enabled as far as I can tell:</div>
<div><br>
</div>
<div># cat /sys/fs/cgroup/cgroup.controllers<br>
cpuset cpu io memory hugetlb pids rdma<br>
# [ $(stat -fc %T /sys/fs/cgroup/) = "cgroup2fs" ] && echo "unified" || ( [ -e /sys/fs/cgroup/unified/ ] && echo "hybrid" || echo "legacy")<br>
unified</div>
<div><br>
</div>
<div>And my slurm.conf has:</div>
<div><br>
</div>
<div>ProctrackType=proctrack/cgroup</div>
<div>TaskPlugin=task/affinity,task/cgroup</div>
<div><br>
</div>
<div>And cgroup.conf:</div>
<div><br>
</div>
<div>CgroupAutomount=yes<br>
CgroupPlugin=autodetect</div>
<div><br>
</div>
<div>What else should I look for before giving up and reverting to cgroupsv1? My current version is 22.05.3, but it was happening in 22.05.2 as well.<br>
</div>
<div><br>
</div>
<div>Thank you for any advice.<br>
</div>
<div>-- <br>
<div dir="ltr">
<div dir="ltr">
<div>Alan Orth<br>
<a href="mailto:alan.orth@gmail.com" target="_blank">alan.orth@gmail.com</a><br>
<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__picturingjordan.com&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=Crq2NCkLF76f5LeQhObq0JdnDo_EKcfYlXcq0iyqQvQ&e=" target="_blank">https://picturingjordan.com</a><br>
<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__englishbulgaria.net&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=K9dvD9QmS3EWZctC_BnTaz7zdTgF_t3qdDwOtYyCHL8&e=" target="_blank">https://englishbulgaria.net</a><br>
<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__mjanja.ch&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=D9vI36K8ewQZH9ZIUAAnhRMAJJNdjfbCE9WI-5KuJuU&e=" target="_blank">https://mjanja.ch</a></div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<br clear="all">
<br>
-- <br>
<div dir="ltr">
<div dir="ltr">
<div>Alan Orth<br>
<a href="mailto:alan.orth@gmail.com" target="_blank">alan.orth@gmail.com</a><br>
<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__picturingjordan.com&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=Crq2NCkLF76f5LeQhObq0JdnDo_EKcfYlXcq0iyqQvQ&e=" target="_blank">https://picturingjordan.com</a><br>
<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__englishbulgaria.net&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=K9dvD9QmS3EWZctC_BnTaz7zdTgF_t3qdDwOtYyCHL8&e=" target="_blank">https://englishbulgaria.net</a><br>
<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__mjanja.ch&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=VdVezmCbZuLlhdKBk1emX2rlpWZ2DrL3v-wR0vX7eA4&m=N42Yb1QseMPG8NAPSqhZ5rm7pVFWwTJFjk5YMlMzfRSkD81fZ84pjsBff4qnxNE1&s=D9vI36K8ewQZH9ZIUAAnhRMAJJNdjfbCE9WI-5KuJuU&e=" target="_blank">https://mjanja.ch</a></div>
</div>
</div>
</div>
</div>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div>Alan Orth<br><a href="mailto:alan.orth@gmail.com" target="_blank">alan.orth@gmail.com</a><br><a href="https://picturingjordan.com" target="_blank">https://picturingjordan.com</a><br><a href="https://englishbulgaria.net" target="_blank">https://englishbulgaria.net</a><br><a href="https://mjanja.ch" target="_blank">https://mjanja.ch</a></div></div></div>