[slurm-users] Munge decode failing on new node

Sun Apr 19 15:29:48 UTC 2020

I see potentially 2 things you should likely do:

 1. Run ntpd on your nodes. You can even have them sync with your master.
 2. Sync your user data on the nodes too. Even if that is just ensuring
    /etc/passwd and /etc/group are the same on them all

While ntp is not required for slurm, the time sync is very important and 
ntp makes that a non-issue. Best practices and all.

#2 is something that is often overlooked, but obvious when you think 
about it.
I have seen folks add users my doing 'useradd' on each node, but that 
messes everything up if you installed a package or such that changed the 
next uid on any node.

The error below looks like you may have a different uid for the slurm 
user on the node. What uid is slurmd running as on the bad node vs a 
good node?

Brian Andrus

On 4/17/2020 2:38 PM, Dean Schulze wrote:
> Just noticed this.  On the problem node the munged.log file has an 
> entry every 1:40:
>
> 2020-04-17 15:31:02 -0600 Info:      Invalid credential
> 2020-04-17 15:32:42 -0600 Info:      Invalid credential
> 2020-04-17 15:34:22 -0600 Info:      Invalid credential
>
> This happens on the failed node and two other nodes that work.  Two 
> nodes that work (including the controller) don't have this message.
>
>
>
> On Fri, Apr 17, 2020 at 2:00 PM Riebs, Andy <andy.riebs at hpe.com 
> <mailto:andy.riebs at hpe.com>> wrote:
>
>     A couple of quick checks to see if the problem is munge:
>
>     1.On the problem node, try
>     $ echo foo | munge | unmunge
>
>     2.If (1) works, try this from the node running slurmctld to the
>     problem node
>     slurm-node$ echo foo | ssh node munge | unmunge
>
>     *From:*slurm-users [mailto:slurm-users-bounces at lists.schedmd.com
>     <mailto:slurm-users-bounces at lists.schedmd.com>] *On Behalf Of
>     *Dean Schulze
>     *Sent:* Friday, April 17, 2020 3:40 PM
>     *To:* Slurm User Community List <slurm-users at lists.schedmd.com
>     <mailto:slurm-users at lists.schedmd.com>>
>     *Subject:* Re: [slurm-users] Munge decode failing on new node
>
>     There is no ntp service running on any of my nodes, and all but
>     this one is working.  I haven't heard that ntp is a requirement
>     for slurm, just that the time be synchronized across the cluster. 
>     And it is.
>
>     On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy <minibit at gmail.com
>     <mailto:minibit at gmail.com>> wrote:
>
>         I’d check ntp as your encoding time seems odd to me
>
>         On Wed, 15 Apr 2020 at 19:59, Dean Schulze
>         <dean.w.schulze at gmail.com <mailto:dean.w.schulze at gmail.com>>
>         wrote:
>
>             I've installed two new nodes onto my slurm cluster.  One
>             node works, but the other one complains about an invalid
>             credential for munge.  I've verified that the munge.key is
>             the same as on all other nodes with
>
>
>             sudo cksum /etc/munge/munge.key
>
>             I recopied a munge.key from a node that works.  I've
>             verified that munge uid and gid are the same on the
>             nodes.  The time is in sync on all nodes.
>
>             Here is what is in the slurmd.log:
>
>              error: Unable to register: Unable to contact slurm
>             controller (connect failure)
>              error: Munge decode failed: Invalid credential
>              ENCODED: Wed Dec 31 17:00:00 1969
>              DECODED: Wed Dec 31 17:00:00 1969
>              error: authentication: Invalid authentication credential
>              error: slurm_receive_msg_and_forward: Protocol
>             authentication error
>              error: service_connection: slurm_receive_msg: Protocol
>             authentication error
>              error: Unable to register: Unable to contact slurm
>             controller (connect failure)
>
>             I've checked in the munged.log and all it says is
>
>             Invalid credential
>
>             Thanks for your help
>
>         -- 
>
>         --
>         Carles Fenoy
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200419/9716e3da/attachment.htm>