<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><meta name=Generator content="Microsoft Word 15 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
p.gmail-m-7619039430370511186msolistparagraph, li.gmail-m-7619039430370511186msolistparagraph, div.gmail-m-7619039430370511186msolistparagraph
{mso-style-name:gmail-m_-7619039430370511186msolistparagraph;
mso-margin-top-alt:auto;
margin-right:0in;
mso-margin-bottom-alt:auto;
margin-left:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
@list l0
{mso-list-id:945697166;
mso-list-template-ids:-1055364414;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoNormal>The uid and gid are the same for the slurm and munge users on each node. The two new nodes, one of which can’t connect with the controller, have the same users and were created with the same sequence of steps. The only exception is that the node that won’t connect has the software stack to compile slurm installed on it. I’ll try removing these packages and see if that makes any difference.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>I was wrong about the nodes not having ntp. They are all running systemd-timesyncd.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>I’ve found something interesting and inconsistent on the nodes that I’ll post in a new thread since this one is going nowhere.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><div><div style='border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in'><p class=MsoNormal><b>From:</b> slurm-users <slurm-users-bounces@lists.schedmd.com> <b>On Behalf Of </b>Brian Andrus<br><b>Sent:</b> Sunday, April 19, 2020 9:30 AM<br><b>To:</b> slurm-users@lists.schedmd.com<br><b>Subject:</b> Re: [slurm-users] Munge decode failing on new node<o:p></o:p></p></div></div><p class=MsoNormal><o:p> </o:p></p><p>I see potentially 2 things you should likely do:<o:p></o:p></p><ol start=1 type=1><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'>Run ntpd on your nodes. You can even have them sync with your master. <o:p></o:p></li><li class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1'>Sync your user data on the nodes too. Even if that is just ensuring /etc/passwd and /etc/group are the same on them all<o:p></o:p></li></ol><p>While ntp is not required for slurm, the time sync is very important and ntp makes that a non-issue. Best practices and all.<o:p></o:p></p><p>#2 is something that is often overlooked, but obvious when you think about it.<br>I have seen folks add users my doing 'useradd' on each node, but that messes everything up if you installed a package or such that changed the next uid on any node.<o:p></o:p></p><p>The error below looks like you may have a different uid for the slurm user on the node. What uid is slurmd running as on the bad node vs a good node?<o:p></o:p></p><p>Brian Andrus<o:p></o:p></p><p><o:p> </o:p></p><div><p class=MsoNormal>On 4/17/2020 2:38 PM, Dean Schulze wrote:<o:p></o:p></p></div><blockquote style='margin-top:5.0pt;margin-bottom:5.0pt'><div><p class=MsoNormal>Just noticed this. On the problem node the munged.log file has an entry every 1:40: <o:p></o:p></p><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>2020-04-17 15:31:02 -0600 Info: Invalid credential<br>2020-04-17 15:32:42 -0600 Info: Invalid credential<br>2020-04-17 15:34:22 -0600 Info: Invalid credential<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>This happens on the failed node and two other nodes that work. Two nodes that work (including the controller) don't have this message.<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div></div><p class=MsoNormal><o:p> </o:p></p><div><div><p class=MsoNormal>On Fri, Apr 17, 2020 at 2:00 PM Riebs, Andy <<a href="mailto:andy.riebs@hpe.com">andy.riebs@hpe.com</a>> wrote:<o:p></o:p></p></div><blockquote style='border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in'><div><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB style='color:#1F497D'>A couple of quick checks to see if the problem is munge:</span><span lang=EN-GB><o:p></o:p></span></p><p class=gmail-m-7619039430370511186msolistparagraph><span lang=EN-GB style='color:#1F497D'>1.</span><span lang=EN-GB style='font-size:7.0pt;font-family:"Times New Roman",serif;color:#1F497D'> </span><span lang=EN-GB style='color:#1F497D'>On the problem node, try<br>$ echo foo | munge | unmunge</span><span lang=EN-GB><o:p></o:p></span></p><p class=gmail-m-7619039430370511186msolistparagraph><span lang=EN-GB style='color:#1F497D'>2.</span><span lang=EN-GB style='font-size:7.0pt;font-family:"Times New Roman",serif;color:#1F497D'> </span><span lang=EN-GB style='color:#1F497D'>If (1) works, try this from the node running slurmctld to the problem node<br>slurm-node$ echo foo | ssh node munge | unmunge</span><span lang=EN-GB><o:p></o:p></span></p><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB style='color:#1F497D'> </span><span lang=EN-GB><o:p></o:p></span></p><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><b>From:</b> slurm-users [mailto:<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>] <b>On Behalf Of </b>Dean Schulze<br><b>Sent:</b> Friday, April 17, 2020 3:40 PM<br><b>To:</b> Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br><b>Subject:</b> Re: [slurm-users] Munge decode failing on new node<span lang=EN-GB><o:p></o:p></span></p><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB> <o:p></o:p></span></p><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB>There is no ntp service running on any of my nodes, and all but this one is working. I haven't heard that ntp is a requirement for slurm, just that the time be synchronized across the cluster. And it is.<o:p></o:p></span></p></div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB> <o:p></o:p></span></p><div><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB>On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy <<a href="mailto:minibit@gmail.com" target="_blank">minibit@gmail.com</a>> wrote:<o:p></o:p></span></p></div><blockquote style='border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt'><div><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB>I’d check ntp as your encoding time seems odd to me<o:p></o:p></span></p></div></div><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB> <o:p></o:p></span></p><div><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB>On Wed, 15 Apr 2020 at 19:59, Dean Schulze <<a href="mailto:dean.w.schulze@gmail.com" target="_blank">dean.w.schulze@gmail.com</a>> wrote:<o:p></o:p></span></p></div><blockquote style='border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt'><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB>I've installed two new nodes onto my slurm cluster. One node works, but the other one complains about an invalid credential for munge. I've verified that the munge.key is the same as on all other nodes with<o:p></o:p></span></p><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB><br>sudo cksum /etc/munge/munge.key<o:p></o:p></span></p></div><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB> <o:p></o:p></span></p></div><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB>I recopied a munge.key from a node that works. I've verified that munge uid and gid are the same on the nodes. The time is in sync on all nodes. <o:p></o:p></span></p></div><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB> <o:p></o:p></span></p></div><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB>Here is what is in the slurmd.log:<o:p></o:p></span></p><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB> <o:p></o:p></span></p></div><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB> error: Unable to register: Unable to contact slurm controller (connect failure)<br> error: Munge decode failed: Invalid credential<br> ENCODED: Wed Dec 31 17:00:00 1969<br> DECODED: Wed Dec 31 17:00:00 1969<br> error: authentication: Invalid authentication credential<br> error: slurm_receive_msg_and_forward: Protocol authentication error<br> error: service_connection: slurm_receive_msg: Protocol authentication error<br> error: Unable to register: Unable to contact slurm controller (connect failure)<o:p></o:p></span></p></div></div><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB> <o:p></o:p></span></p></div><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB>I've checked in the munged.log and all it says is <o:p></o:p></span></p></div><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB> <o:p></o:p></span></p></div><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB>Invalid credential <o:p></o:p></span></p></div><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB> <o:p></o:p></span></p></div><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB>Thanks for your help<o:p></o:p></span></p></div></div></blockquote></div></div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB>-- <o:p></o:p></span></p><div><p class=MsoNormal style='mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span lang=EN-GB>--<br>Carles Fenoy<o:p></o:p></span></p></div></blockquote></div></div></div></blockquote></div></blockquote></div></body></html>