[slurm-users] One node won't connect and false positive messages from slurm every 1 minute 40 seconds

Dean Schulze dean.w.schulze at gmail.com
Wed Apr 22 20:24:28 UTC 2020


I added two new nodes to my cluster (5 nodes total including controller).
One of the new nodes works, but the other one can't connect to the
controller.  Both new nodes were created the same way except that the one
that can't connect to the controller has some extra packages installed to
build slurm.  Initially I thought this was a problem with munge, but now it
looks like it is something else.  Here's what I'm seeing in the logs.

The new node that won't connect shows a periodic entry in
/var/log/slurm/slurmd.log every 1:40 from munge showing an invalid
credential.  The /var/log/munge/munged.log file shows a corresponding entry
"Invalid credential" every 1:40.

fabricnode1 (won't connect to controller)
/var/log/slurm/slurmd.log
[2020-04-22T12:54:54.154] error: Munge decode failed: Invalid credential
[2020-04-22T12:54:54.154] ENCODED: Wed Dec 31 17:00:00 1969
[2020-04-22T12:54:54.154] DECODED: Wed Dec 31 17:00:00 1969
[2020-04-22T12:54:54.154] error: authentication: Invalid authentication
credential
[2020-04-22T12:54:54.154] error: slurm_receive_msg_and_forward: Protocol
authentication error
[2020-04-22T12:54:54.165] error: service_connection: slurm_receive_msg:
Protocol authentication error
[2020-04-22T12:55:18.694] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2020-04-22T12:55:48.716] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2020-04-22T12:56:18.737] error: Unable to register: Unable to contact
slurm controller (connect failure)
[2020-04-22T12:56:34.710] error: Munge decode failed: Invalid credential
[2020-04-22T12:56:34.710] ENCODED: Wed Dec 31 17:00:00 1969
[2020-04-22T12:56:34.710] DECODED: Wed Dec 31 17:00:00 1969
[2020-04-22T12:56:34.710] error: authentication: Invalid authentication
credential
[2020-04-22T12:56:34.710] error: slurm_receive_msg_and_forward: Protocol
authentication error
[2020-04-22T12:56:34.720] error: service_connection: slurm_receive_msg:
Protocol authentication error

/var/log/munge/munged.log
020-04-22 12:56:34 -0600 Info:      Invalid credential
2020-04-22 12:58:14 -0600 Info:      Invalid credential


So that points to munge, but the new node that does connect shows the same
munge entries in the logs.  The only difference are the three lines "Unable
to register: Unable to contact slurm controller (connect failure)" aren't
there:

fabricnode2 (connects)
/var/log/slurm/slurmd.log
[2020-04-22T12:51:34.899] error: Munge decode failed: Invalid credential
[2020-04-22T12:51:34.899] ENCODED: Wed Dec 31 17:00:00 1969
[2020-04-22T12:51:34.899] DECODED: Wed Dec 31 17:00:00 1969
[2020-04-22T12:51:34.899] error: authentication: Invalid authentication
credential
[2020-04-22T12:51:34.899] error: slurm_receive_msg_and_forward: Protocol
authentication error
[2020-04-22T12:51:34.909] error: service_connection: slurm_receive_msg:
Protocol authentication error
[2020-04-22T12:53:14.482] error: Munge decode failed: Invalid credential
[2020-04-22T12:53:14.482] ENCODED: Wed Dec 31 17:00:00 1969
[2020-04-22T12:53:14.482] DECODED: Wed Dec 31 17:00:00 1969
[2020-04-22T12:53:14.482] error: authentication: Invalid authentication
credential
[2020-04-22T12:53:14.482] error: slurm_receive_msg_and_forward: Protocol
authentication error
[2020-04-22T12:53:14.492] error: service_connection: slurm_receive_msg:
Protocol authentication error

/var/log/munge/munged.log
2020-04-22 12:51:34 -0600 Info:      Invalid credential
2020-04-22 12:53:14 -0600 Info:      Invalid credential


Also, one of the existing nodes that has always connected to the controller
shows the same munge entries in the log files:

slurmnode1 (connects)
/var/log/slurm/slurmd.log
[2020-04-22T12:58:14.541] error: Munge decode failed: Invalid credential
[2020-04-22T12:58:14.541] ENCODED: Wed Dec 31 17:00:00 1969
[2020-04-22T12:58:14.541] DECODED: Wed Dec 31 17:00:00 1969
[2020-04-22T12:58:14.541] error: authentication: Invalid authentication
credential
[2020-04-22T12:58:14.541] error: slurm_receive_msg_and_forward: Protocol
authentication error
[2020-04-22T12:58:14.551] error: service_connection: slurm_receive_msg:
Protocol authentication error
[2020-04-22T12:59:54.155] error: Munge decode failed: Invalid credential
[2020-04-22T12:59:54.155] ENCODED: Wed Dec 31 17:00:00 1969
[2020-04-22T12:59:54.155] DECODED: Wed Dec 31 17:00:00 1969
[2020-04-22T12:59:54.155] error: authentication: Invalid authentication
credential
[2020-04-22T12:59:54.155] error: slurm_receive_msg_and_forward: Protocol
authentication error
[2020-04-22T12:59:54.166] error: service_connection: slurm_receive_msg:
Protocol authentication error

/var/log/munge/munged.log
2020-04-22 12:58:14 -0600 Info:      Invalid credential
2020-04-22 12:59:54 -0600 Info:      Invalid credential


The other existing node that has always connected doesn't have any periodic
entries in the logs while idle.

Since two nodes that do connect to the controller show the same munge
entries in the logs as the one node that won't connect it looks like munge
is a red herring.  Munge appears to log false positive messages every 1
minute 40 seconds.

Does the 1:40 second periodic entry in the log file ring a bell with
anyone?  Any other ideas?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200422/e1a441fc/attachment.htm>


More information about the slurm-users mailing list