[slurm-users] [External] Munge thinks clocks aren't synced
Prentice Bisbal
pbisbal at pppl.gov
Thu Oct 29 18:58:19 UTC 2020
Good catch. I didn't even notice that. I definitely think that is
ntpd.conf file on the head node is restricting access by IP range.
Prentice
On 10/28/20 3:04 AM, Williams, Gareth (IM&T, Black Mountain) wrote:
>
> I’m pretty sure that ntp info indicates ntp is not working. reach=0 so
> no successful connections in many cycles.
>
> https://www.linuxjournal.com/article/6812
>
> Gareth
>
> *From:*slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf
> Of *Barbara Krašovec
> *Sent:* Wednesday, 28 October 2020 5:41 PM
> *To:* slurm-users at lists.schedmd.com
> *Subject:* Re: [slurm-users] [External] Munge thinks clocks aren't synced
>
> Rewound credential error means that credential appears to have been
> encoded by more than TTL seconds in the future (default munge TTL is 5
> minutes). So the clock on the decoding host is slower than on the
> encoding host. You can try to run munge with a different TTL (munge
> -t) just to verify if it is a time sync issue. Also check the time on
> the munge.key.
>
> I don't think it's related to the new subnet.
>
> Cheers,
>
> Barbara
>
> On 10/27/20 9:58 PM, Gard Nelson wrote:
>
> Thanks for your help, Prentice.
>
> Sorry, yes – centos 7.5 installed on a fresh HDD. I rebooted and
> checked that chronyd is disabled. ntpd is running. The rest of the
> cluster uses centos 7.5 and ntp so it’s possible, although maybe
> not ideal.
>
> I’m running ntpq on the new compute node. It is looking to the
> slurm head node which is also set up as the ntp server. Here’s the
> output:
>
> [root ~]# ntpq -p
>
> remote refid st t when poll reach delay offset
> jitter
>
> ==============================================================================
>
> HEADNODE_IP .XFAC. 16 u - 1024 0 0.000 0.000
> 0.000
>
> It was a bit of a pain to get set up. The time difference was
> several hours so ntp would have taken ages to fix on its own. I
> have used ntpdate successfully on the existing compute nodes, but
> got a “no server suitable for synchronization found” error here.
> ‘ntpd -gqx’ timed out. So in order to set the time, I had to point
> ntp to the default centos pool of ntp servers to set the time and
> then point it back to the headnode. After that, ‘ntpd -gqx’ ran
> smoothly and I assume (based on the ntpq output) that it worked.
> Running ‘date’ on the new compute and existing head node
> simultaneously returns the same time to within ~1 sec rather than
> the 7:30 gap from the log file.
>
> Not sure if it’s relevant to this problem, but the new compute
> node is on a different subnet connected to a different port than
> the existing compute nodes. This is the first time that I’ve set
> up a node on a different subnet. I figured it be simple to point
> slurm to the new node, but I didn’t anticipate ntp and munge issues.
>
> Thanks,
>
> Gard
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com>
> <mailto:slurm-users-bounces at lists.schedmd.com> on behalf of
> Prentice Bisbal <pbisbal at pppl.gov> <mailto:pbisbal at pppl.gov>
> *Reply-To: *Slurm User Community List
> <slurm-users at lists.schedmd.com> <mailto:slurm-users at lists.schedmd.com>
> *Date: *Tuesday, October 27, 2020 at 12:22 PM
> *To: *"slurm-users at lists.schedmd.com"
> <mailto:slurm-users at lists.schedmd.com>
> <slurm-users at lists.schedmd.com> <mailto:slurm-users at lists.schedmd.com>
> *Subject: *Re: [slurm-users] [External] Munge thinks clocks aren't
> synced
>
> You don't specify what OS or version you're using. If you're using
> RHEL 7 or a derivative, chrony is used by default over ntpd, so
> there could be some confusion between chronyd and ntpd. If you
> haven't done so already, I'd check to see which daemon is actually
> running on your system.
>
> Can you share the complete output of ntpq -p with us, and let us
> know what nodes the output is from? You might want to run
> 'ntpdate' before starting ntpd. If the clocks are too far off,
> either ntpd won't correct the time, or it will take a long time.
> ntpdate immediately syncs up the time between servers.
>
> I would make sure ntpdate is installed and enabled, then reboot
> both compute nodes. This will make sure that ntpdate is called at
> startup before ntpd, and will then make sure all start using the
> correct time.
>
> --
> Prentice
>
> On 10/27/20 2:08 PM, Gard Nelson wrote:
>
> Hi everyone,
>
> I’m adding a new node to an existing cluster. After installing
> slurm and the prereqs, I synced the clocks with ntpd. When I
> run ‘ntpq -p’, I get 0.0 for delay, offset and jitter. (the
> slurm head node is also the ntp server) ‘date’ also gives me
> identical times for the head and compute nodes. However, when
> I start slurmd, I get a munge error about the clocks being out
> of sync. From the slurmctld log:
>
> [2020-10-27T11:02:06.511] node NEW_NODE returned to service
>
> [2020-10-27T11:02:07.265] error: Munge decode failed: Rewound
> credential
>
> [2020-10-27T11:02:07.265] ENCODED: Tue Oct 27 11:09:45 2020
>
> [2020-10-27T11:02:07.265] DECODED: Tue Oct 27 11:02:07 2020
>
> [2020-10-27T11:02:07.265] error: Check for out of sync clocks
>
> [2020-10-27T11:02:07.265] error: slurm_unpack_received_msg:
> MESSAGE_NODE_REGISTRATION_STATUS has authentication error:
> Rewound credential
>
> [2020-10-27T11:02:07.265] error: slurm_unpack_received_msg:
> Protocol authentication error
>
> [2020-10-27T11:02:07.275] error: slurm_receive_msg
> [HEAD_NODE_IP:PORT]: Unspecified error
>
> I restarted ntp, munge and the slurm daemons on both nodes
> before this last error was generated. Any idea what’s going on
> here?
>
> Thanks,
>
> Gard
>
>
> CONFIDENTIALITY NOTICE
> This e-mail message and any attachments are only for
> the use of the intended recipient and may contain
> information that is privileged, confidential or
> exempt from disclosure under applicable law. If you
> are not the intended recipient, any disclosure,
> distribution or other use of this e-mail message or
> attachments is prohibited. If you have received this
> e-mail message in error, please delete and notify
> the sender immediately. Thank you.
>
> --
>
> Prentice Bisbal
>
> Lead Software Engineer
>
> Research Computing
>
> Princeton Plasma Physics Laboratory
>
> http://www.pppl.gov <https://urldefense.com/v3/__http:/www.pppl.gov__;!!LM3lv1w8qtQ!AUViCRtpIXKV37Z4WGp5j64ppClYVIuzUEXXvfoDHHD_tVjDVMA9b2gBHtaWUHsEPdvmkQ$>
>
--
Prentice
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201029/a7ee201d/attachment.htm>
More information about the slurm-users
mailing list