<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>I see potentially 2 things you should likely do:</p>
    <ol>
      <li>Run ntpd on your nodes. You can even have them sync with your
        master. <br>
      </li>
      <li>Sync your user data on the nodes too. Even if that is just
        ensuring /etc/passwd and /etc/group are the same on them all</li>
    </ol>
    <p>While ntp is not required for slurm, the time sync is very
      important and ntp makes that a non-issue. Best practices and all.</p>
    <p>#2 is something that is often overlooked, but obvious when you
      think about it.<br>
      I have seen folks add users my doing 'useradd' on each node, but
      that messes everything up if you installed a package or such that
      changed the next uid on any node.</p>
    <p>The error below looks like you may have a different uid for the
      slurm user on the node. What uid is slurmd running as on the bad
      node vs a good node?</p>
    <p>Brian Andrus<br>
    </p>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 4/17/2020 2:38 PM, Dean Schulze
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CA+LiX6E+d0nS1TYQjeP0cE8REDOFSwvvztGnN2jU_sQsLRdhZA@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">Just noticed this.  On the problem node the
        munged.log file has an entry every 1:40:
        <div><br>
        </div>
        <div>2020-04-17 15:31:02 -0600 Info:      Invalid credential<br>
          2020-04-17 15:32:42 -0600 Info:      Invalid credential<br>
          2020-04-17 15:34:22 -0600 Info:      Invalid credential<br>
        </div>
        <div><br>
        </div>
        <div>This happens on the failed node and two other nodes that
          work.  Two nodes that work (including the controller) don't
          have this message.</div>
        <div><br>
        </div>
        <div><br>
        </div>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Fri, Apr 17, 2020 at 2:00
          PM Riebs, Andy <<a href="mailto:andy.riebs@hpe.com"
            moz-do-not-send="true">andy.riebs@hpe.com</a>> wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px
          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div lang="EN-GB">
            <div class="gmail-m_-7619039430370511186WordSection1">
              <p class="MsoNormal"><span
style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">A
                  couple of quick checks to see if the problem is munge:</span></p>
              <p class="gmail-m_-7619039430370511186MsoListParagraph"><span
style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)"><span>1.<span
                      style="font:7pt "Times New Roman"">      
                    </span></span></span><span
style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">On
                  the problem node, try<br>
                  $ echo foo | munge | unmunge</span></p>
              <p class="gmail-m_-7619039430370511186MsoListParagraph"><span
style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)"><span>2.<span
                      style="font:7pt "Times New Roman"">      
                    </span></span></span><span
style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)">If
                  (1) works, try this from the node running slurmctld to
                  the problem node<br>
                  slurm-node$ echo foo | ssh node munge | unmunge</span></p>
              <p class="MsoNormal"><span
style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)"> </span></p>
              <p class="MsoNormal"><b><span
                    style="font-size:11pt;font-family:Calibri,sans-serif"
                    lang="EN-US">From:</span></b><span
                  style="font-size:11pt;font-family:Calibri,sans-serif"
                  lang="EN-US"> slurm-users [mailto:<a
                    href="mailto:slurm-users-bounces@lists.schedmd.com"
                    target="_blank" moz-do-not-send="true">slurm-users-bounces@lists.schedmd.com</a>]
                  <b>On Behalf Of </b>Dean Schulze<br>
                  <b>Sent:</b> Friday, April 17, 2020 3:40 PM<br>
                  <b>To:</b> Slurm User Community List <<a
                    href="mailto:slurm-users@lists.schedmd.com"
                    target="_blank" moz-do-not-send="true">slurm-users@lists.schedmd.com</a>><br>
                  <b>Subject:</b> Re: [slurm-users] Munge decode failing
                  on new node</span></p>
              <p class="MsoNormal"> </p>
              <div>
                <p class="MsoNormal">There is no ntp service running on
                  any of my nodes, and all but this one is working.  I
                  haven't heard that ntp is a requirement for slurm,
                  just that the time be synchronized across the
                  cluster.  And it is.</p>
              </div>
              <p class="MsoNormal"> </p>
              <div>
                <div>
                  <p class="MsoNormal">On Wed, Apr 15, 2020 at 12:17 PM
                    Carlos Fenoy <<a href="mailto:minibit@gmail.com"
                      target="_blank" moz-do-not-send="true">minibit@gmail.com</a>>
                    wrote:</p>
                </div>
                <blockquote
style="border-top:none;border-right:none;border-bottom:none;border-left:1pt
                  solid rgb(204,204,204);padding:0in 0in 0in
                  6pt;margin-left:4.8pt;margin-right:0in">
                  <div>
                    <div>
                      <p class="MsoNormal">I’d check ntp as your
                        encoding time seems odd to me</p>
                    </div>
                  </div>
                  <div>
                    <p class="MsoNormal"> </p>
                    <div>
                      <div>
                        <p class="MsoNormal">On Wed, 15 Apr 2020 at
                          19:59, Dean Schulze <<a
                            href="mailto:dean.w.schulze@gmail.com"
                            target="_blank" moz-do-not-send="true">dean.w.schulze@gmail.com</a>>
                          wrote:</p>
                      </div>
                      <blockquote
style="border-top:none;border-right:none;border-bottom:none;border-left:1pt
                        solid rgb(204,204,204);padding:0in 0in 0in
                        6pt;margin-left:4.8pt;margin-right:0in">
                        <div>
                          <p class="MsoNormal">I've installed two new
                            nodes onto my slurm cluster.  One node
                            works, but the other one complains about an
                            invalid credential for munge.  I've verified
                            that the munge.key is the same as on all
                            other nodes with</p>
                          <div>
                            <p class="MsoNormal"><br>
                              sudo cksum /etc/munge/munge.key</p>
                          </div>
                          <div>
                            <p class="MsoNormal"> </p>
                          </div>
                          <div>
                            <p class="MsoNormal">I recopied a munge.key
                              from a node that works.  I've verified
                              that munge uid and gid are the same on the
                              nodes.  The time is in sync on all nodes. </p>
                          </div>
                          <div>
                            <p class="MsoNormal"> </p>
                          </div>
                          <div>
                            <p class="MsoNormal">Here is what is in the
                              slurmd.log:</p>
                            <div>
                              <p class="MsoNormal"> </p>
                            </div>
                            <div>
                              <p class="MsoNormal"> error: Unable to
                                register: Unable to contact slurm
                                controller (connect failure)<br>
                                 error: Munge decode failed: Invalid
                                credential<br>
                                 ENCODED: Wed Dec 31 17:00:00 1969<br>
                                 DECODED: Wed Dec 31 17:00:00 1969<br>
                                 error: authentication: Invalid
                                authentication credential<br>
                                 error: slurm_receive_msg_and_forward:
                                Protocol authentication error<br>
                                 error: service_connection:
                                slurm_receive_msg: Protocol
                                authentication error<br>
                                 error: Unable to register: Unable to
                                contact slurm controller (connect
                                failure)</p>
                            </div>
                          </div>
                          <div>
                            <p class="MsoNormal"> </p>
                          </div>
                          <div>
                            <p class="MsoNormal">I've checked in the
                              munged.log and all it says is </p>
                          </div>
                          <div>
                            <p class="MsoNormal"> </p>
                          </div>
                          <div>
                            <p class="MsoNormal">Invalid credential </p>
                          </div>
                          <div>
                            <p class="MsoNormal"> </p>
                          </div>
                          <div>
                            <p class="MsoNormal">Thanks for your help</p>
                          </div>
                        </div>
                      </blockquote>
                    </div>
                  </div>
                  <p class="MsoNormal">-- </p>
                  <div>
                    <p class="MsoNormal">--<br>
                      Carles Fenoy</p>
                  </div>
                </blockquote>
              </div>
            </div>
          </div>
        </blockquote>
      </div>
    </blockquote>
  </body>
</html>