[slurm-users] [External] Munge thinks clocks aren't synced

Barbara Krašovec barbara.krasovec at ijs.si
Wed Oct 28 06:40:48 UTC 2020


Rewound credential error means that credential appears to have been
encoded by more than TTL seconds in the future (default munge TTL is 5
minutes). So the clock on the decoding host is slower than on the
encoding host. You can try to run munge with a different TTL (munge -t)
just to verify if it is a time sync issue. Also check the time on the
munge.key.

I don't think it's related to the new subnet.

Cheers,

Barbara/
/

On 10/27/20 9:58 PM, Gard Nelson wrote:
>
> Thanks for your help, Prentice.
>
>  
>
> Sorry, yes – centos 7.5 installed on a fresh HDD. I rebooted and
> checked that chronyd is disabled. ntpd is running. The rest of the
> cluster uses centos 7.5 and ntp so it’s possible, although maybe not
> ideal.
>
>  
>
> I’m running ntpq on the new compute node. It is looking to the slurm
> head node which is also set up as the ntp server. Here’s the output:
>
>  
>
> [root ~]# ntpq -p
>
>      remote           refid      st t when poll reach   delay  
> offset  jitter
>
> ==============================================================================
>
> HEADNODE_IP     .XFAC.          16 u    - 1024    0    0.000   
> 0.000   0.000
>
>  
>
> It was a bit of a pain to get set up. The time difference was several
> hours so ntp would have taken ages to fix on its own. I have used
> ntpdate successfully on the existing compute nodes, but got a “no
> server suitable for synchronization found” error here. ‘ntpd -gqx’
> timed out. So in order to set the time, I had to point ntp to the
> default centos pool of ntp servers to set the time and then point it
> back to the headnode. After that, ‘ntpd -gqx’ ran smoothly and I
> assume (based on the ntpq output) that it worked. Running ‘date’ on
> the new compute and existing head node simultaneously returns the same
> time to within ~1 sec rather than the 7:30 gap from the log file.
>
>  
>
> Not sure if it’s relevant to this problem, but the new compute node is
> on a different subnet connected to a different port than the existing
> compute nodes. This is the first time that I’ve set up a node on a
> different subnet. I figured it be simple to point slurm to the new
> node, but I didn’t anticipate ntp and munge issues.
>
>  
>
> Thanks,
>
> Gard
>
>  
>
>  
>
>  
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf
> of Prentice Bisbal <pbisbal at pppl.gov>
> *Reply-To: *Slurm User Community List <slurm-users at lists.schedmd.com>
> *Date: *Tuesday, October 27, 2020 at 12:22 PM
> *To: *"slurm-users at lists.schedmd.com" <slurm-users at lists.schedmd.com>
> *Subject: *Re: [slurm-users] [External] Munge thinks clocks aren't synced
>
>  
>
> You don't specify what OS or version you're using. If you're using
> RHEL 7 or a derivative, chrony is used by default over ntpd, so there
> could be some confusion between chronyd and ntpd. If you haven't done
> so already, I'd check to see which daemon is actually running on your
> system.
>
> Can you share the complete output of ntpq -p with us, and let us know
> what nodes the output is from? You might want to run 'ntpdate' before
> starting ntpd. If the clocks are too far off, either ntpd won't
> correct the time, or it will take a long time. ntpdate immediately
> syncs up the time between servers.
>
> I would make sure ntpdate is installed and enabled, then reboot both
> compute nodes. This will make sure that ntpdate is called at startup
> before ntpd, and will then make sure all start using the correct time.
>
> --
> Prentice
>
>  
>
> On 10/27/20 2:08 PM, Gard Nelson wrote:
>
>     Hi everyone,
>
>      
>
>     I’m adding a new node to an existing cluster. After installing
>     slurm and the prereqs, I synced the clocks with ntpd. When I run
>     ‘ntpq -p’, I get 0.0 for delay, offset and jitter. (the slurm head
>     node is also the ntp server) ‘date’ also gives me identical times
>     for the head and compute nodes. However, when I start slurmd, I
>     get a munge error about the clocks being out of sync. From the
>     slurmctld log:
>
>      
>
>     [2020-10-27T11:02:06.511] node NEW_NODE returned to service
>
>     [2020-10-27T11:02:07.265] error: Munge decode failed: Rewound
>     credential
>
>     [2020-10-27T11:02:07.265] ENCODED: Tue Oct 27 11:09:45 2020
>
>     [2020-10-27T11:02:07.265] DECODED: Tue Oct 27 11:02:07 2020
>
>     [2020-10-27T11:02:07.265] error: Check for out of sync clocks
>
>     [2020-10-27T11:02:07.265] error: slurm_unpack_received_msg:
>     MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Rewound
>     credential
>
>     [2020-10-27T11:02:07.265] error: slurm_unpack_received_msg:
>     Protocol authentication error
>
>     [2020-10-27T11:02:07.275] error: slurm_receive_msg
>     [HEAD_NODE_IP:PORT]: Unspecified error
>
>      
>
>     I restarted ntp, munge and the slurm daemons on both nodes before
>     this last error was generated. Any idea what’s going on here?
>
>      
>
>     Thanks,
>
>     Gard
>
>
>               CONFIDENTIALITY NOTICE
>               This e-mail message and any attachments are only for the
>               use of the intended recipient and may contain
>               information that is privileged, confidential or exempt
>               from disclosure under applicable law. If you are not the
>               intended recipient, any disclosure, distribution or
>               other use of this e-mail message or attachments is
>               prohibited. If you have received this e-mail message in
>               error, please delete and notify the sender immediately.
>               Thank you.
>
> -- 
> Prentice Bisbal
> Lead Software Engineer
> Research Computing
> Princeton Plasma Physics Laboratory
> http://www.pppl.gov <https://urldefense.com/v3/__http:/www.pppl.gov__;!!LM3lv1w8qtQ!AUViCRtpIXKV37Z4WGp5j64ppClYVIuzUEXXvfoDHHD_tVjDVMA9b2gBHtaWUHsEPdvmkQ$>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201028/536e43ce/attachment-0001.htm>


More information about the slurm-users mailing list