[slurm-users] Nodes are down after 2-3 minutes.

Mon May 7 13:57:34 MDT 2018

The two most likely causes of munge complaints:

1. Different keys in /etc/munge/munge.key
2. Clocks out of sync on the nodes in question

Andy

On 05/07/2018 03:50 PM, Eric F. Alemany wrote:
> Greetings,
>
> Reminder: i am new to SLURM.
>
> When i execute  “sinfo” my nodes are down.
>
> sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*       up   infinite      4  down* radonc[01-04]
>
> This is what i have done so far and nothing has helped. The nodes are 
> in “idle” state for 2-3 minutes and then there are “down” again.
>
> systemctl restart slurmd    on all nodes
>
> systemctl restart slurmctld  on master
>
> scontrol update node=radonc[01-04] state=UNDRAIN
>
> scontrol update node=radonc[01-04] state=IDLE
>
>
>
> I looked at the log file in /var/log/SlurmdLogFile.log  and saw some 
> “munge decode failed: Invalid credential”
>
> [2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: 
> MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid 
> credential
> [2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: Protocol 
> authentication error
> [2018-05-07T12:37:20.028] error: Munge decode failed: Invalid credential
> [2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: 
> MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid 
> credential
> [2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: Protocol 
> authentication error
> [2018-05-07T12:37:20.038] error: slurm_receive_msg 
> [10.112.0.14:42140]: Unspecified error
> [2018-05-07T12:37:20.038] error: slurm_receive_msg [10.112.0.5:34752]: 
> Unspecified error
> [2018-05-07T12:37:20.038] error: slurm_receive_msg [10.112.0.6:46746]: 
> Unspecified error
> [2018-05-07T12:37:20.039] error: slurm_receive_msg 
> [10.112.0.16:50788]: Unspecified error
>
>
> I ran the following command on all nodes (including master/headnode) 
> and got “Success”
>
>  munge -n | unmunge | grep STATUS
> *STATUS*:           Success (0)
>
>
> How can I fix this problem?
>
>
> Thank you in advance for all your help.
>
> Eric
>
>
> _____________________________________________________________________________________________________
>
> *
> *Eric F.  Alemany*
> *
> /System Administrator for Research/
>
> Division of Radiation & Cancer  Biology
> Department of Radiation Oncology
>
> Stanford University School of Medicine
> Stanford, California 94305
>
> Tel:1-650-498-7969 <tel:1-650-498-7969>  No Texting
> Fax:1-650-723-7382 <tel:1-650-723-7382>
>
>
>

-- 
Andy Riebs
andy.riebs at hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
     May the source be with you!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180507/17fc3235/attachment.html>