<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><meta name=Generator content="Microsoft Word 15 (filtered medium)"><style><!--

/* Font Definitions */

@font-face

        {font-family:宋体;

        panose-1:2 1 6 0 3 1 1 1 1 1;}

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:等线;

        panose-1:2 1 6 0 3 1 1 1 1 1;}

@font-face

        {font-family:"\@宋体";

        panose-1:2 1 6 0 3 1 1 1 1 1;}

@font-face

        {font-family:"\@等线";

        panose-1:2 1 6 0 3 1 1 1 1 1;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0cm;

        margin-bottom:.0001pt;

        text-align:justify;

        text-justify:inter-ideograph;

        font-size:10.5pt;

        font-family:等线;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:#0563C1;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:#954F72;

        text-decoration:underline;}

p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph

        {mso-style-priority:34;

        margin:0cm;

        margin-bottom:.0001pt;

        text-align:justify;

        text-justify:inter-ideograph;

        text-indent:21.0pt;

        font-size:10.5pt;

        font-family:等线;}

p.msonormal0, li.msonormal0, div.msonormal0

        {mso-style-name:msonormal;

        mso-margin-top-alt:auto;

        margin-right:0cm;

        mso-margin-bottom-alt:auto;

        margin-left:0cm;

        text-align:left;

        font-size:12.0pt;

        font-family:宋体;}

span.EmailStyle20

        {mso-style-type:personal;

        font-family:等线;

        color:windowtext;}

span.EmailStyle21

        {mso-style-type:personal-reply;

        font-family:等线;

        color:windowtext;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}

@page WordSection1

        {size:612.0pt 792.0pt;

        margin:72.0pt 90.0pt 72.0pt 90.0pt;}

div.WordSection1

        {page:WordSection1;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]--></head><body lang=ZH-CN link="#0563C1" vlink="#954F72"><div class=WordSection1><p class=MsoNormal><span lang=EN-US>Well, after increase slurmctld log level to debug, we do found some error related to munge like:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>[2022-06-04T15:17:21.258] debug:  auth/munge: _decode_cred: Munge decode failed: Failed to connect to "/run/munge/munge.socket.2": Resource temporarily unavailable (retrying ...)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>But when test munge manually, it works well between slurm2 and other compute nodes.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>> munge -n | ssh node010 unmunge<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>The authenticity of host 'node010 (192.168.1.10)' can't be established.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>RSA key fingerprint is SHA256:/fx4zQPDDPHj7df6ml0Fd0kn8cIKkSO0OgKpF+qcRDI.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Are you sure you want to continue connecting (yes/no/[fingerprint])? yes<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Warning: Permanently added 'node010,192.168.1.10' (RSA) to the list of known hosts.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Password:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>STATUS:          Success (0)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>ENCODE_HOST:     slurm2 (192.168.0.33)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>ENCODE_TIME:     2022-06-04 16:11:35 +0800 (1654330295)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>DECODE_TIME:     2022-06-04 16:11:52 +0800 (1654330312)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>TTL:             300<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>CIPHER:          aes128 (4)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>MAC:             sha256 (5)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>ZIP:             none (0)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>UID:             root (0)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>GID:             root (0)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>LENGTH:          0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Of course munge at compute nodes and unmunge at slurm2 also work well.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>So what else does slurmctld required from munge? Or what is the difference between slurm auth/munge from manually munge/unmunge test?<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><div><div style='border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm'><p class=MsoNormal align=left style='text-align:left'><b><span style='font-size:11.0pt'>发件人<span lang=EN-US>:</span></span></b><span lang=EN-US style='font-size:11.0pt'> Brian Andrus <> <br></span><b><span style='font-size:11.0pt'>发送时间<span lang=EN-US>:</span></span></b><span lang=EN-US style='font-size:11.0pt'> 2022</span><span style='font-size:11.0pt'>年<span lang=EN-US>6</span>月<span lang=EN-US>3</span>日<span lang=EN-US> 21:16<br></span><b>收件人<span lang=EN-US>:</span></b><span lang=EN-US> slurm-users@lists.schedmd.com<br></span><b>主题<span lang=EN-US>:</span></b><span lang=EN-US> Re: [slurm-users] what is the possible reason for secondary slurmctld node not allocate job after takeover?<o:p></o:p></span></span></p></div></div><p class=MsoNormal align=left style='text-align:left'><span lang=EN-US><o:p> </o:p></span></p><p><span lang=EN-US>Offhand, I would suggest double check munge and versions of slurmd/slurmctld.</span><span lang=EN-US style='font-size:12.0pt'><o:p></o:p></span></p><p><span lang=EN-US>Brian Andrus<o:p></o:p></span></p><div><p class=MsoNormal><span lang=EN-US>On 6/3/2022 3:17 AM, <a href="mailto:taleintervenor@sjtu.edu.cn">taleintervenor@sjtu.edu.cn</a> wrote:<o:p></o:p></span></p></div><blockquote style='margin-top:5.0pt;margin-bottom:5.0pt'><p class=MsoNormal><span lang=EN-US>Hi, all:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Our cluster set up 2 slurm control node and scontrol show config as below:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>> scontrol show config<o:p></o:p></span></p><p class=MsoNormal>…<span lang=EN-US><o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>SlurmctldHost[0]        = slurm1<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>SlurmctldHost[1]        = slurm2<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>StateSaveLocation       = /etc/slurm/state<o:p></o:p></span></p><p class=MsoNormal>…<span lang=EN-US><o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Of course we have make sure both node has the some slurm conf and mount the same nfs on StateSaveLocation and can read/write it. (but there system is different, slurm1 is centos7 and slurm2 is centos8)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>When slurm1 control the cluster and slurm2 work in standby mode, the cluster has no problem.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>But when we use </span>“<span lang=EN-US>scontrol takeover</span>”<span lang=EN-US> on slurm2 to switch the primary role, we find new-submit jobs all stuck in PD state.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>No job will be allocated resource by slurm2, no matter how long we wait. Meanwhile old running jobs can complete without problem, and query command like </span>“<span lang=EN-US>sinfo</span>”<span lang=EN-US>, </span>“<span lang=EN-US>sacct</span>”<span lang=EN-US> all work well.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>The pending reason is firstly shown as </span>“<span lang=EN-US>priority</span>”<span lang=EN-US> in squeue, but after we manually update the priority, it become </span>“<span lang=EN-US>none</span>”<span lang=EN-US> reason and still stuck in PD state.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>During slurm2 primary period, there is no significant error in slurmctld.log. Only after we restart the slurm1 service to let slurm2 return to standby role, it report lots of error as:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>error: Invalid RPC received MESSAGE_NODE_REGISTRATION_STATUS while in standby mode<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>error: Invalid RPC received REQUEST_COMPLETE_PROLOG while in standby mode<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>error: Invalid RPC received REQUEST_COMPLETE_JOB_ALLOCATION while in standby mode<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>So is there any suggestion to find the reason why slurm2 work abnormally as primary controller?<o:p></o:p></span></p></blockquote></div></body></html>