[slurm-users] [External] Munge thinks clocks aren't synced

Prentice Bisbal pbisbal at pppl.gov
Tue Oct 27 19:21:44 UTC 2020


You don't specify what OS or version you're using. If you're using RHEL 
7 or a derivative, chrony is used by default over ntpd, so there could 
be some confusion between chronyd and ntpd. If you haven't done so 
already, I'd check to see which daemon is actually running on your system.

Can you share the complete output of ntpq -p with us, and let us know 
what nodes the output is from? You might want to run 'ntpdate' before 
starting ntpd. If the clocks are too far off, either ntpd won't correct 
the time, or it will take a long time. ntpdate immediately syncs up the 
time between servers.

I would make sure ntpdate is installed and enabled, then reboot both 
compute nodes. This will make sure that ntpdate is called at startup 
before ntpd, and will then make sure all start using the correct time.

--
Prentice


On 10/27/20 2:08 PM, Gard Nelson wrote:
>
> Hi everyone,
>
> I’m adding a new node to an existing cluster. After installing slurm 
> and the prereqs, I synced the clocks with ntpd. When I run ‘ntpq -p’, 
> I get 0.0 for delay, offset and jitter. (the slurm head node is also 
> the ntp server) ‘date’ also gives me identical times for the head and 
> compute nodes. However, when I start slurmd, I get a munge error about 
> the clocks being out of sync. From the slurmctld log:
>
> [2020-10-27T11:02:06.511] node NEW_NODE returned to service
>
> [2020-10-27T11:02:07.265] error: Munge decode failed: Rewound credential
>
> [2020-10-27T11:02:07.265] ENCODED: Tue Oct 27 11:09:45 2020
>
> [2020-10-27T11:02:07.265] DECODED: Tue Oct 27 11:02:07 2020
>
> [2020-10-27T11:02:07.265] error: Check for out of sync clocks
>
> [2020-10-27T11:02:07.265] error: slurm_unpack_received_msg: 
> MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Rewound 
> credential
>
> [2020-10-27T11:02:07.265] error: slurm_unpack_received_msg: Protocol 
> authentication error
>
> [2020-10-27T11:02:07.275] error: slurm_receive_msg 
> [HEAD_NODE_IP:PORT]: Unspecified error
>
> I restarted ntp, munge and the slurm daemons on both nodes before this 
> last error was generated. Any idea what’s going on here?
>
> Thanks,
>
> Gard
>
>
>           CONFIDENTIALITY NOTICE
>           This e-mail message and any attachments are only for the use
>           of the intended recipient and may contain information that
>           is privileged, confidential or exempt from disclosure under
>           applicable law. If you are not the intended recipient, any
>           disclosure, distribution or other use of this e-mail message
>           or attachments is prohibited. If you have received this
>           e-mail message in error, please delete and notify the sender
>           immediately. Thank you.
>
-- 
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201027/9dc0894e/attachment-0001.htm>


More information about the slurm-users mailing list