<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Offhand, I would suggest double check munge and versions of
slurmd/slurmctld.</p>
<p>Brian Andrus<br>
</p>
<div class="moz-cite-prefix">On 6/3/2022 3:17 AM,
<a class="moz-txt-link-abbreviated" href="mailto:taleintervenor@sjtu.edu.cn">taleintervenor@sjtu.edu.cn</a> wrote:<br>
</div>
<blockquote type="cite"
cite="mid:089001d87733$17787f80$46697e80$@sjtu.edu.cn">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style>@font-face
{font-family:Wingdings;
panose-1:5 0 0 0 0 0 0 0 0 0;}@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face
{font-family:DengXian;
panose-1:2 1 6 0 3 1 1 1 1 1;}@font-face
{font-family:DengXian;
panose-1:2 1 6 0 3 1 1 1 1 1;}p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
text-align:justify;
text-justify:inter-ideograph;
font-size:10.5pt;
font-family:DengXian;}a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin:0cm;
margin-bottom:.0001pt;
text-align:justify;
text-justify:inter-ideograph;
text-indent:21.0pt;
font-size:10.5pt;
font-family:DengXian;}span.EmailStyle17
{mso-style-type:personal-compose;
font-family:DengXian;
color:windowtext;}.MsoChpDefault
{mso-style-type:export-only;
font-family:DengXian;}div.WordSection1
{page:WordSection1;}ol
{margin-bottom:0cm;}ul
{margin-bottom:0cm;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="MsoNormal"><span lang="EN-US">Hi, all:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">Our cluster set up 2
slurm control node and scontrol show config as below:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">> scontrol show
config<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">…<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">SlurmctldHost[0]
= slurm1<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">SlurmctldHost[1]
= slurm2<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">StateSaveLocation
= /etc/slurm/state<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">…<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">Of course we have make
sure both node has the some slurm conf and mount the same
nfs on StateSaveLocation and can read/write it. (but there
system is different, slurm1 is centos7 and slurm2 is
centos8)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">When slurm1 control the
cluster and slurm2 work in standby mode, the cluster has no
problem.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">But when we use
“scontrol takeover” on slurm2 to switch the primary role, we
find new-submit jobs all stuck in PD state.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">No job will be allocated
resource by slurm2, no matter how long we wait. Meanwhile
old running jobs can complete without problem, and query
command like “sinfo”, “sacct” all work well.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">The pending reason is
firstly shown as “priority” in squeue, but after we manually
update the priority, it become “none” reason and still stuck
in PD state.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">During slurm2 primary
period, there is no significant error in slurmctld.log. Only
after we restart the slurm1 service to let slurm2 return to
standby role, it report lots of error as:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">error: Invalid RPC
received MESSAGE_NODE_REGISTRATION_STATUS while in standby
mode<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">error: Invalid RPC
received REQUEST_COMPLETE_PROLOG while in standby mode<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">error: Invalid RPC
received REQUEST_COMPLETE_JOB_ALLOCATION while in standby
mode<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">So is there any
suggestion to find the reason why slurm2 work abnormally as
primary controller?<o:p></o:p></span></p>
</div>
</blockquote>
</body>
</html>