[slurm-users] [External] Munge thinks clocks aren't synced

Prentice Bisbal pbisbal at pppl.gov
Thu Oct 29 18:56:05 UTC 2020


Having the head node run as an NTP server is a good idea. I set up my 
clusters the same way. Is it possible that ntp.conf on the head node has 
a restrict statement that restricts access to it by IP address/range, 
which is why this one node on a different network can't reach it?

It sounds like it's working now, but I don't understand why ntpdate 
would give you that error unless it couldn't reach ntpd on the head node.

Prentice


On 10/27/20 4:58 PM, Gard Nelson wrote:
>
> Thanks for your help, Prentice.
>
> Sorry, yes – centos 7.5 installed on a fresh HDD. I rebooted and 
> checked that chronyd is disabled. ntpd is running. The rest of the 
> cluster uses centos 7.5 and ntp so it’s possible, although maybe not 
> ideal.
>
> I’m running ntpq on the new compute node. It is looking to the slurm 
> head node which is also set up as the ntp server. Here’s the output:
>
> [root ~]# ntpq -p
>
> remote           refid      st t when poll reach   delay offset  jitter
>
> ==============================================================================
>
> HEADNODE_IP .XFAC.          16 u    - 1024    0    0.000    0.000 0.000
>
> It was a bit of a pain to get set up. The time difference was several 
> hours so ntp would have taken ages to fix on its own. I have used 
> ntpdate successfully on the existing compute nodes, but got a “no 
> server suitable for synchronization found” error here. ‘ntpd -gqx’ 
> timed out. So in order to set the time, I had to point ntp to the 
> default centos pool of ntp servers to set the time and then point it 
> back to the headnode. After that, ‘ntpd -gqx’ ran smoothly and I 
> assume (based on the ntpq output) that it worked. Running ‘date’ on 
> the new compute and existing head node simultaneously returns the same 
> time to within ~1 sec rather than the 7:30 gap from the log file.
>
> Not sure if it’s relevant to this problem, but the new compute node is 
> on a different subnet connected to a different port than the existing 
> compute nodes. This is the first time that I’ve set up a node on a 
> different subnet. I figured it be simple to point slurm to the new 
> node, but I didn’t anticipate ntp and munge issues.
>
> Thanks,
>
> Gard
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf 
> of Prentice Bisbal <pbisbal at pppl.gov>
> *Reply-To: *Slurm User Community List <slurm-users at lists.schedmd.com>
> *Date: *Tuesday, October 27, 2020 at 12:22 PM
> *To: *"slurm-users at lists.schedmd.com" <slurm-users at lists.schedmd.com>
> *Subject: *Re: [slurm-users] [External] Munge thinks clocks aren't synced
>
> You don't specify what OS or version you're using. If you're using 
> RHEL 7 or a derivative, chrony is used by default over ntpd, so there 
> could be some confusion between chronyd and ntpd. If you haven't done 
> so already, I'd check to see which daemon is actually running on your 
> system.
>
> Can you share the complete output of ntpq -p with us, and let us know 
> what nodes the output is from? You might want to run 'ntpdate' before 
> starting ntpd. If the clocks are too far off, either ntpd won't 
> correct the time, or it will take a long time. ntpdate immediately 
> syncs up the time between servers.
>
> I would make sure ntpdate is installed and enabled, then reboot both 
> compute nodes. This will make sure that ntpdate is called at startup 
> before ntpd, and will then make sure all start using the correct time.
>
> --
> Prentice
>
> On 10/27/20 2:08 PM, Gard Nelson wrote:
>
>     Hi everyone,
>
>     I’m adding a new node to an existing cluster. After installing
>     slurm and the prereqs, I synced the clocks with ntpd. When I run
>     ‘ntpq -p’, I get 0.0 for delay, offset and jitter. (the slurm head
>     node is also the ntp server) ‘date’ also gives me identical times
>     for the head and compute nodes. However, when I start slurmd, I
>     get a munge error about the clocks being out of sync. From the
>     slurmctld log:
>
>     [2020-10-27T11:02:06.511] node NEW_NODE returned to service
>
>     [2020-10-27T11:02:07.265] error: Munge decode failed: Rewound
>     credential
>
>     [2020-10-27T11:02:07.265] ENCODED: Tue Oct 27 11:09:45 2020
>
>     [2020-10-27T11:02:07.265] DECODED: Tue Oct 27 11:02:07 2020
>
>     [2020-10-27T11:02:07.265] error: Check for out of sync clocks
>
>     [2020-10-27T11:02:07.265] error: slurm_unpack_received_msg:
>     MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Rewound
>     credential
>
>     [2020-10-27T11:02:07.265] error: slurm_unpack_received_msg:
>     Protocol authentication error
>
>     [2020-10-27T11:02:07.275] error: slurm_receive_msg
>     [HEAD_NODE_IP:PORT]: Unspecified error
>
>     I restarted ntp, munge and the slurm daemons on both nodes before
>     this last error was generated. Any idea what’s going on here?
>
>     Thanks,
>
>     Gard
>
>
>               CONFIDENTIALITY NOTICE
>               This e-mail message and any attachments are only for the
>               use of the intended recipient and may contain
>               information that is privileged, confidential or exempt
>               from disclosure under applicable law. If you are not the
>               intended recipient, any disclosure, distribution or
>               other use of this e-mail message or attachments is
>               prohibited. If you have received this e-mail message in
>               error, please delete and notify the sender immediately.
>               Thank you.
>
> -- 
> Prentice Bisbal
> Lead Software Engineer
> Research Computing
> Princeton Plasma Physics Laboratory
> http://www.pppl.gov  <https://urldefense.com/v3/__http:/www.pppl.gov__;!!LM3lv1w8qtQ!AUViCRtpIXKV37Z4WGp5j64ppClYVIuzUEXXvfoDHHD_tVjDVMA9b2gBHtaWUHsEPdvmkQ$>

-- 
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201029/820b548b/attachment-0001.htm>


More information about the slurm-users mailing list