[slurm-users] [External] Munge thinks clocks aren't synced

Prentice Bisbal pbisbal at pppl.gov
Thu Oct 29 18:58:19 UTC 2020


Good catch. I didn't even notice that. I definitely think that is 
ntpd.conf file on the head node is restricting access by IP range.

Prentice

On 10/28/20 3:04 AM, Williams, Gareth (IM&T, Black Mountain) wrote:
>
> I’m pretty sure that ntp info indicates ntp is not working. reach=0 so 
> no successful connections in many cycles.
>
> https://www.linuxjournal.com/article/6812
>
> Gareth
>
> *From:*slurm-users <slurm-users-bounces at lists.schedmd.com> *On Behalf 
> Of *Barbara Krašovec
> *Sent:* Wednesday, 28 October 2020 5:41 PM
> *To:* slurm-users at lists.schedmd.com
> *Subject:* Re: [slurm-users] [External] Munge thinks clocks aren't synced
>
> Rewound credential error means that credential appears to have been 
> encoded by more than TTL seconds in the future (default munge TTL is 5 
> minutes). So the clock on the decoding host is slower than on the 
> encoding host. You can try to run munge with a different TTL (munge 
> -t) just to verify if it is a time sync issue. Also check the time on 
> the munge.key.
>
> I don't think it's related to the new subnet.
>
> Cheers,
>
> Barbara
>
> On 10/27/20 9:58 PM, Gard Nelson wrote:
>
>     Thanks for your help, Prentice.
>
>     Sorry, yes – centos 7.5 installed on a fresh HDD. I rebooted and
>     checked that chronyd is disabled. ntpd is running. The rest of the
>     cluster uses centos 7.5 and ntp so it’s possible, although maybe
>     not ideal.
>
>     I’m running ntpq on the new compute node. It is looking to the
>     slurm head node which is also set up as the ntp server. Here’s the
>     output:
>
>     [root ~]# ntpq -p
>
>     remote           refid      st t when poll reach   delay offset 
>     jitter
>
>     ==============================================================================
>
>     HEADNODE_IP .XFAC.          16 u    - 1024    0    0.000    0.000
>     0.000
>
>     It was a bit of a pain to get set up. The time difference was
>     several hours so ntp would have taken ages to fix on its own. I
>     have used ntpdate successfully on the existing compute nodes, but
>     got a “no server suitable for synchronization found” error here.
>     ‘ntpd -gqx’ timed out. So in order to set the time, I had to point
>     ntp to the default centos pool of ntp servers to set the time and
>     then point it back to the headnode. After that, ‘ntpd -gqx’ ran
>     smoothly and I assume (based on the ntpq output) that it worked.
>     Running ‘date’ on the new compute and existing head node
>     simultaneously returns the same time to within ~1 sec rather than
>     the 7:30 gap from the log file.
>
>     Not sure if it’s relevant to this problem, but the new compute
>     node is on a different subnet connected to a different port than
>     the existing compute nodes. This is the first time that I’ve set
>     up a node on a different subnet. I figured it be simple to point
>     slurm to the new node, but I didn’t anticipate ntp and munge issues.
>
>     Thanks,
>
>     Gard
>
>     *From: *slurm-users <slurm-users-bounces at lists.schedmd.com>
>     <mailto:slurm-users-bounces at lists.schedmd.com> on behalf of
>     Prentice Bisbal <pbisbal at pppl.gov> <mailto:pbisbal at pppl.gov>
>     *Reply-To: *Slurm User Community List
>     <slurm-users at lists.schedmd.com> <mailto:slurm-users at lists.schedmd.com>
>     *Date: *Tuesday, October 27, 2020 at 12:22 PM
>     *To: *"slurm-users at lists.schedmd.com"
>     <mailto:slurm-users at lists.schedmd.com>
>     <slurm-users at lists.schedmd.com> <mailto:slurm-users at lists.schedmd.com>
>     *Subject: *Re: [slurm-users] [External] Munge thinks clocks aren't
>     synced
>
>     You don't specify what OS or version you're using. If you're using
>     RHEL 7 or a derivative, chrony is used by default over ntpd, so
>     there could be some confusion between chronyd and ntpd. If you
>     haven't done so already, I'd check to see which daemon is actually
>     running on your system.
>
>     Can you share the complete output of ntpq -p with us, and let us
>     know what nodes the output is from? You might want to run
>     'ntpdate' before starting ntpd. If the clocks are too far off,
>     either ntpd won't correct the time, or it will take a long time.
>     ntpdate immediately syncs up the time between servers.
>
>     I would make sure ntpdate is installed and enabled, then reboot
>     both compute nodes. This will make sure that ntpdate is called at
>     startup before ntpd, and will then make sure all start using the
>     correct time.
>
>     --
>     Prentice
>
>     On 10/27/20 2:08 PM, Gard Nelson wrote:
>
>         Hi everyone,
>
>         I’m adding a new node to an existing cluster. After installing
>         slurm and the prereqs, I synced the clocks with ntpd. When I
>         run ‘ntpq -p’, I get 0.0 for delay, offset and jitter. (the
>         slurm head node is also the ntp server) ‘date’ also gives me
>         identical times for the head and compute nodes. However, when
>         I start slurmd, I get a munge error about the clocks being out
>         of sync. From the slurmctld log:
>
>         [2020-10-27T11:02:06.511] node NEW_NODE returned to service
>
>         [2020-10-27T11:02:07.265] error: Munge decode failed: Rewound
>         credential
>
>         [2020-10-27T11:02:07.265] ENCODED: Tue Oct 27 11:09:45 2020
>
>         [2020-10-27T11:02:07.265] DECODED: Tue Oct 27 11:02:07 2020
>
>         [2020-10-27T11:02:07.265] error: Check for out of sync clocks
>
>         [2020-10-27T11:02:07.265] error: slurm_unpack_received_msg:
>         MESSAGE_NODE_REGISTRATION_STATUS has authentication error:
>         Rewound credential
>
>         [2020-10-27T11:02:07.265] error: slurm_unpack_received_msg:
>         Protocol authentication error
>
>         [2020-10-27T11:02:07.275] error: slurm_receive_msg
>         [HEAD_NODE_IP:PORT]: Unspecified error
>
>         I restarted ntp, munge and the slurm daemons on both nodes
>         before this last error was generated. Any idea what’s going on
>         here?
>
>         Thanks,
>
>         Gard
>
>
>                   CONFIDENTIALITY NOTICE
>                   This e-mail message and any attachments are only for
>                   the use of the intended recipient and may contain
>                   information that is privileged, confidential or
>                   exempt from disclosure under applicable law. If you
>                   are not the intended recipient, any disclosure,
>                   distribution or other use of this e-mail message or
>                   attachments is prohibited. If you have received this
>                   e-mail message in error, please delete and notify
>                   the sender immediately. Thank you.
>
>     -- 
>
>     Prentice Bisbal
>
>     Lead Software Engineer
>
>     Research Computing
>
>     Princeton Plasma Physics Laboratory
>
>     http://www.pppl.gov  <https://urldefense.com/v3/__http:/www.pppl.gov__;!!LM3lv1w8qtQ!AUViCRtpIXKV37Z4WGp5j64ppClYVIuzUEXXvfoDHHD_tVjDVMA9b2gBHtaWUHsEPdvmkQ$>
>
-- 
Prentice

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201029/a7ee201d/attachment.htm>


More information about the slurm-users mailing list