[slurm-users] Munge decode failing on new node
Brian Andrus
toomuchit at gmail.com
Sun Apr 19 15:29:48 UTC 2020
I see potentially 2 things you should likely do:
1. Run ntpd on your nodes. You can even have them sync with your master.
2. Sync your user data on the nodes too. Even if that is just ensuring
/etc/passwd and /etc/group are the same on them all
While ntp is not required for slurm, the time sync is very important and
ntp makes that a non-issue. Best practices and all.
#2 is something that is often overlooked, but obvious when you think
about it.
I have seen folks add users my doing 'useradd' on each node, but that
messes everything up if you installed a package or such that changed the
next uid on any node.
The error below looks like you may have a different uid for the slurm
user on the node. What uid is slurmd running as on the bad node vs a
good node?
Brian Andrus
On 4/17/2020 2:38 PM, Dean Schulze wrote:
> Just noticed this. On the problem node the munged.log file has an
> entry every 1:40:
>
> 2020-04-17 15:31:02 -0600 Info: Invalid credential
> 2020-04-17 15:32:42 -0600 Info: Invalid credential
> 2020-04-17 15:34:22 -0600 Info: Invalid credential
>
> This happens on the failed node and two other nodes that work. Two
> nodes that work (including the controller) don't have this message.
>
>
>
> On Fri, Apr 17, 2020 at 2:00 PM Riebs, Andy <andy.riebs at hpe.com
> <mailto:andy.riebs at hpe.com>> wrote:
>
> A couple of quick checks to see if the problem is munge:
>
> 1.On the problem node, try
> $ echo foo | munge | unmunge
>
> 2.If (1) works, try this from the node running slurmctld to the
> problem node
> slurm-node$ echo foo | ssh node munge | unmunge
>
> *From:*slurm-users [mailto:slurm-users-bounces at lists.schedmd.com
> <mailto:slurm-users-bounces at lists.schedmd.com>] *On Behalf Of
> *Dean Schulze
> *Sent:* Friday, April 17, 2020 3:40 PM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com
> <mailto:slurm-users at lists.schedmd.com>>
> *Subject:* Re: [slurm-users] Munge decode failing on new node
>
> There is no ntp service running on any of my nodes, and all but
> this one is working. I haven't heard that ntp is a requirement
> for slurm, just that the time be synchronized across the cluster.
> And it is.
>
> On Wed, Apr 15, 2020 at 12:17 PM Carlos Fenoy <minibit at gmail.com
> <mailto:minibit at gmail.com>> wrote:
>
> I’d check ntp as your encoding time seems odd to me
>
> On Wed, 15 Apr 2020 at 19:59, Dean Schulze
> <dean.w.schulze at gmail.com <mailto:dean.w.schulze at gmail.com>>
> wrote:
>
> I've installed two new nodes onto my slurm cluster. One
> node works, but the other one complains about an invalid
> credential for munge. I've verified that the munge.key is
> the same as on all other nodes with
>
>
> sudo cksum /etc/munge/munge.key
>
> I recopied a munge.key from a node that works. I've
> verified that munge uid and gid are the same on the
> nodes. The time is in sync on all nodes.
>
> Here is what is in the slurmd.log:
>
> error: Unable to register: Unable to contact slurm
> controller (connect failure)
> error: Munge decode failed: Invalid credential
> ENCODED: Wed Dec 31 17:00:00 1969
> DECODED: Wed Dec 31 17:00:00 1969
> error: authentication: Invalid authentication credential
> error: slurm_receive_msg_and_forward: Protocol
> authentication error
> error: service_connection: slurm_receive_msg: Protocol
> authentication error
> error: Unable to register: Unable to contact slurm
> controller (connect failure)
>
> I've checked in the munged.log and all it says is
>
> Invalid credential
>
> Thanks for your help
>
> --
>
> --
> Carles Fenoy
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200419/9716e3da/attachment.htm>
More information about the slurm-users
mailing list