<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Having the head node run as an NTP server is a good idea. I set
up my clusters the same way. Is it possible that ntp.conf on the
head node has a restrict statement that restricts access to it by
IP address/range, which is why this one node on a different
network can't reach it? <br>
</p>
<p>It sounds like it's working now, but I don't understand why
ntpdate would give you that error unless it couldn't reach ntpd on
the head node. <br>
</p>
<p>Prentice<br>
</p>
<p><br>
</p>
<div class="moz-cite-prefix">On 10/27/20 4:58 PM, Gard Nelson wrote:<br>
</div>
<blockquote type="cite"
cite="mid:6786CE5A-329D-48C7-B7E2-0E989D441571@nantbio.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:Courier;
panose-1:0 0 0 0 0 0 0 0 0 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Consolas;
panose-1:2 11 6 9 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Calibri",sans-serif;}
h5
{mso-style-priority:9;
mso-style-link:"Heading 5 Char";
mso-margin-top-alt:auto;
margin-right:0in;
mso-margin-bottom-alt:auto;
margin-left:0in;
font-size:10.0pt;
font-family:"Calibri",sans-serif;
font-weight:bold;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
pre
{mso-style-priority:99;
mso-style-link:"HTML Preformatted Char";
margin:0in;
margin-bottom:.0001pt;
font-size:10.0pt;
font-family:"Courier New";}
p.msonormal0, li.msonormal0, div.msonormal0
{mso-style-name:msonormal;
mso-margin-top-alt:auto;
margin-right:0in;
mso-margin-bottom-alt:auto;
margin-left:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle19
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:windowtext;}
span.Heading5Char
{mso-style-name:"Heading 5 Char";
mso-style-priority:9;
mso-style-link:"Heading 5";
font-family:"Calibri Light",sans-serif;
color:#2F5496;}
span.HTMLPreformattedChar
{mso-style-name:"HTML Preformatted Char";
mso-style-priority:99;
mso-style-link:"HTML Preformatted";
font-family:"Consolas",serif;}
span.EmailStyle23
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style>
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt">Thanks for
your help, Prentice.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Sorry, yes –
centos 7.5 installed on a fresh HDD. I rebooted and checked
that chronyd is disabled. ntpd is running. The rest of the
cluster uses centos 7.5 and ntp so it’s possible, although
maybe not ideal.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">I’m running
ntpq on the new compute node. It is looking to the slurm
head node which is also set up as the ntp server. Here’s the
output:<o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:Courier"><o:p> </o:p></span></p>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:Courier">[root ~]# ntpq
-p<o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:Courier">
remote refid st t when poll reach delay
offset jitter<o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:Courier">==============================================================================<o:p></o:p></span></p>
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:Courier">HEADNODE_IP
.XFAC. 16 u - 1024 0 0.000 0.000
0.000<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">It was a bit
of a pain to get set up. The time difference was several
hours so ntp would have taken ages to fix on its own. I have
used ntpdate successfully on the existing compute nodes, but
got a “no server suitable for synchronization found” error
here. ‘ntpd -gqx’ timed out. So in order to set the time, I
had to point ntp to the default centos pool of ntp servers
to set the time and then point it back to the headnode.
After that, ‘ntpd -gqx’ ran smoothly and I assume (based on
the ntpq output) that it worked. Running ‘date’ on the new
compute and existing head node simultaneously returns the
same time to within ~1 sec rather than the 7:30 gap from the
log file.
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Not sure if
it’s relevant to this problem, but the new compute node is
on a different subnet connected to a different port than the
existing compute nodes. This is the first time that I’ve set
up a node on a different subnet. I figured it be simple to
point slurm to the new node, but I didn’t anticipate ntp and
munge issues.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Thanks,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Gard<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<div style="border:none;border-top:solid #B5C4DF
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="color:black">From: </span></b><span
style="color:black">slurm-users
<a class="moz-txt-link-rfc2396E" href="mailto:slurm-users-bounces@lists.schedmd.com"><slurm-users-bounces@lists.schedmd.com></a> on behalf of
Prentice Bisbal <a class="moz-txt-link-rfc2396E" href="mailto:pbisbal@pppl.gov"><pbisbal@pppl.gov></a><br>
<b>Reply-To: </b>Slurm User Community List
<a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com"><slurm-users@lists.schedmd.com></a><br>
<b>Date: </b>Tuesday, October 27, 2020 at 12:22 PM<br>
<b>To: </b><a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com">"slurm-users@lists.schedmd.com"</a>
<a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com"><slurm-users@lists.schedmd.com></a><br>
<b>Subject: </b>Re: [slurm-users] [External] Munge thinks
clocks aren't synced<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
</div>
<p>You don't specify what OS or version you're using. If you're
using RHEL 7 or a derivative, chrony is used by default over
ntpd, so there could be some confusion between chronyd and
ntpd. If you haven't done so already, I'd check to see which
daemon is actually running on your system. <o:p></o:p></p>
<p>Can you share the complete output of ntpq -p with us, and let
us know what nodes the output is from? You might want to run
'ntpdate' before starting ntpd. If the clocks are too far off,
either ntpd won't correct the time, or it will take a long
time. ntpdate immediately syncs up the time between servers. <o:p></o:p></p>
<p>I would make sure ntpdate is installed and enabled, then
reboot both compute nodes. This will make sure that ntpdate is
called at startup before ntpd, and will then make sure all
start using the correct time.
<o:p></o:p></p>
<p>--<br>
Prentice<o:p></o:p></p>
<p><o:p> </o:p></p>
<div>
<p class="MsoNormal">On 10/27/20 2:08 PM, Gard Nelson wrote:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<p class="MsoNormal"><span style="font-size:11.0pt">Hi
everyone,</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt">I’m adding
a new node to an existing cluster. After installing slurm
and the prereqs, I synced the clocks with ntpd. When I run
‘ntpq -p’, I get 0.0 for delay, offset and jitter. (the
slurm head node is also the ntp server) ‘date’ also gives
me identical times for the head and compute nodes.
However, when I start slurmd, I get a munge error about
the clocks being out of sync. From the slurmctld log:</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[2020-10-27T11:02:06.511]
node NEW_NODE returned to service</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[2020-10-27T11:02:07.265]
error: Munge decode failed: Rewound credential</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[2020-10-27T11:02:07.265]
ENCODED: Tue Oct 27 11:09:45 2020</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[2020-10-27T11:02:07.265]
DECODED: Tue Oct 27 11:02:07 2020</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[2020-10-27T11:02:07.265]
error: Check for out of sync clocks</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[2020-10-27T11:02:07.265]
error: slurm_unpack_received_msg:
MESSAGE_NODE_REGISTRATION_STATUS has authentication error:
Rewound credential</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[2020-10-27T11:02:07.265]
error: slurm_unpack_received_msg: Protocol authentication
error</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt">[2020-10-27T11:02:07.275]
error: slurm_receive_msg [HEAD_NODE_IP:PORT]: Unspecified
error</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt">I
restarted ntp, munge and the slurm daemons on both nodes
before this last error was generated. Any idea what’s
going on here?</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Thanks,</span><o:p></o:p></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Gard</span><o:p></o:p></p>
<h5><span style="color:gray">CONFIDENTIALITY NOTICE<br>
This e-mail message and any attachments are only for the
use of the intended recipient and may contain information
that is privileged, confidential or exempt from disclosure
under applicable law. If you are not the intended
recipient, any disclosure, distribution or other use of
this e-mail message or attachments is prohibited. If you
have received this e-mail message in error, please delete
and notify the sender immediately. Thank you.</span><o:p></o:p></h5>
</blockquote>
<pre>-- <o:p></o:p></pre>
<pre>Prentice Bisbal<o:p></o:p></pre>
<pre>Lead Software Engineer<o:p></o:p></pre>
<pre>Research Computing<o:p></o:p></pre>
<pre>Princeton Plasma Physics Laboratory<o:p></o:p></pre>
<pre><a href="https://urldefense.com/v3/__http:/www.pppl.gov__;!!LM3lv1w8qtQ!AUViCRtpIXKV37Z4WGp5j64ppClYVIuzUEXXvfoDHHD_tVjDVMA9b2gBHtaWUHsEPdvmkQ$" moz-do-not-send="true">http://www.pppl.gov</a><o:p></o:p></pre>
</div>
</blockquote>
<pre class="moz-signature" cols="72">--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
<a class="moz-txt-link-freetext" href="http://www.pppl.gov">http://www.pppl.gov</a></pre>
</body>
</html>