[slurm-users] Nodes are down after 2-3 minutes.
Eric F. Alemany
ealemany at stanford.edu
Mon May 7 14:26:58 MDT 2018
Thanks Paul.
_____________________________________________________________________________________________________
Eric F. Alemany
System Administrator for Research
Division of Radiation & Cancer Biology
Department of Radiation Oncology
Stanford University School of Medicine
Stanford, California 94305
Tel:1-650-498-7969<tel:1-650-498-7969> No Texting
Fax:1-650-723-7382<tel:1-650-723-7382>
On May 7, 2018, at 1:07 PM, Paul Edmon <pedmon at cfa.harvard.edu<mailto:pedmon at cfa.harvard.edu>> wrote:
Any command can be used to copy it. We deploy ours using puppet.
-Paul Edmon-
On 05/07/2018 04:04 PM, Eric F. Alemany wrote:
Thanks Andy.
I think i omit a big step which is copying the /etc/munge/munge.key from master/headnode to all the /etc/munge/munge/key in the nodes - am i right? i dont recall doing this so that could be the problem.
Is there a specific command i need to do to copy the munge.key from the master/headnode to all the nodes?
Thank you for your help and sorry for such “beginner” questions.
Best,
Eric
_____________________________________________________________________________________________________
Eric F. Alemany
System Administrator for Research
Division of Radiation & Cancer Biology
Department of Radiation Oncology
Stanford University School of Medicine
Stanford, California 94305
Tel:1-650-498-7969<tel:1-650-498-7969> No Texting
Fax:1-650-723-7382<tel:1-650-723-7382>
On May 7, 2018, at 12:57 PM, Andy Riebs <andy.riebs at hpe.com<mailto:andy.riebs at hpe.com>> wrote:
The two most likely causes of munge complaints:
1. Different keys in /etc/munge/munge.key
2. Clocks out of sync on the nodes in question
Andy
On 05/07/2018 03:50 PM, Eric F. Alemany wrote:
Greetings,
Reminder: i am new to SLURM.
When i execute “sinfo” my nodes are down.
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 4 down* radonc[01-04]
This is what i have done so far and nothing has helped. The nodes are in “idle” state for 2-3 minutes and then there are “down” again.
systemctl restart slurmd on all nodes
systemctl restart slurmctld on master
scontrol update node=radonc[01-04] state=UNDRAIN
scontrol update node=radonc[01-04] state=IDLE
I looked at the log file in /var/log/SlurmdLogFile.log and saw some “munge decode failed: Invalid credential”
[2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid credential
[2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: Protocol authentication error
[2018-05-07T12:37:20.028] error: Munge decode failed: Invalid credential
[2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid credential
[2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: Protocol authentication error
[2018-05-07T12:37:20.038] error: slurm_receive_msg [10.112.0.14:42140]: Unspecified error
[2018-05-07T12:37:20.038] error: slurm_receive_msg [10.112.0.5:34752]: Unspecified error
[2018-05-07T12:37:20.038] error: slurm_receive_msg [10.112.0.6:46746]: Unspecified error
[2018-05-07T12:37:20.039] error: slurm_receive_msg [10.112.0.16:50788]: Unspecified error
I ran the following command on all nodes (including master/headnode) and got “Success”
munge -n | unmunge | grep STATUS
STATUS: Success (0)
How can I fix this problem?
Thank you in advance for all your help.
Eric
_____________________________________________________________________________________________________
Eric F. Alemany
System Administrator for Research
Division of Radiation & Cancer Biology
Department of Radiation Oncology
Stanford University School of Medicine
Stanford, California 94305
Tel:1-650-498-7969<tel:1-650-498-7969> No Texting
Fax:1-650-723-7382<tel:1-650-723-7382>
--
Andy Riebs
andy.riebs at hpe.com<mailto:andy.riebs at hpe.com>
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180507/ac031cf8/attachment-0001.html>
More information about the slurm-users
mailing list