[slurm-users] Nodes are down after 2-3 minutes.

Mon May 7 16:21:46 MDT 2018

Sorry to report that i still have the same problem.

copied the /etc/munge/munge.key from the master to all the nodes.
Checked the date on master and nodes - OK

systemctl restart slurmctld  on master

systemctl restart slurmd on all nodes

checked again /var/log/slurm-llnl/SlurmdLogFile.log

[2018-05-07T15:19:30.936] error: slurm_unpack_received_msg: Protocol authentication error
[2018-05-07T15:19:30.936] error: Munge decode failed: Invalid credential
[2018-05-07T15:19:30.936] error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid credential
[2018-05-07T15:19:30.936] error: slurm_unpack_received_msg: Protocol authentication error
[2018-05-07T15:19:30.946] error: slurm_receive_msg [10.112.0.14:33062]: Unspecified error
[2018-05-07T15:19:30.946] error: slurm_receive_msg [10.112.0.6:37668]: Unspecified error
[2018-05-07T15:19:30.946] error: slurm_receive_msg [10.112.0.5:53906]: Unspecified error
[2018-05-07T15:19:30.946] error: slurm_receive_msg [10.112.0.16:41710]: Unspecified error

Let me know what else can i check/do.

Thanks

_____________________________________________________________________________________________________

Eric F.  Alemany
System Administrator for Research

Division of Radiation & Cancer  Biology
Department of Radiation Oncology

Stanford University School of Medicine
Stanford, California 94305

Tel:1-650-498-7969<tel:1-650-498-7969>  No Texting
Fax:1-650-723-7382<tel:1-650-723-7382>

On May 7, 2018, at 1:07 PM, Paul Edmon <pedmon at cfa.harvard.edu<mailto:pedmon at cfa.harvard.edu>> wrote:

Any command can be used to copy it.  We deploy ours using puppet.

-Paul Edmon-

On 05/07/2018 04:04 PM, Eric F. Alemany wrote:
Thanks Andy.

I think i omit a big step which is copying the /etc/munge/munge.key from master/headnode to all the /etc/munge/munge/key in the nodes - am i right?   i dont recall doing this so that could be the problem.

Is there a specific command i need to do to copy the munge.key from the master/headnode to all the nodes?

Thank you for your help and sorry for such “beginner” questions.

Best,
Eric
_____________________________________________________________________________________________________

Eric F.  Alemany
System Administrator for Research

Division of Radiation & Cancer  Biology
Department of Radiation Oncology

Stanford University School of Medicine
Stanford, California 94305

Tel:1-650-498-7969<tel:1-650-498-7969>  No Texting
Fax:1-650-723-7382<tel:1-650-723-7382>

On May 7, 2018, at 12:57 PM, Andy Riebs <andy.riebs at hpe.com<mailto:andy.riebs at hpe.com>> wrote:

The two most likely causes of munge complaints:

1. Different keys in /etc/munge/munge.key
2. Clocks out of sync on the nodes in question

Andy

On 05/07/2018 03:50 PM, Eric F. Alemany wrote:
Greetings,

Reminder: i am new to SLURM.

When i execute  “sinfo” my nodes are down.

sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      4  down* radonc[01-04]

This is what i have done so far and nothing has helped. The nodes are in “idle” state for 2-3 minutes and then there are “down” again.

systemctl restart slurmd    on all nodes

systemctl restart slurmctld  on master

scontrol update node=radonc[01-04] state=UNDRAIN

scontrol update node=radonc[01-04] state=IDLE

I looked at the log file in /var/log/SlurmdLogFile.log  and saw some “munge decode failed: Invalid credential”

[2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid credential
[2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: Protocol authentication error
[2018-05-07T12:37:20.028] error: Munge decode failed: Invalid credential
[2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid credential
[2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: Protocol authentication error
[2018-05-07T12:37:20.038] error: slurm_receive_msg [10.112.0.14:42140]: Unspecified error
[2018-05-07T12:37:20.038] error: slurm_receive_msg [10.112.0.5:34752]: Unspecified error
[2018-05-07T12:37:20.038] error: slurm_receive_msg [10.112.0.6:46746]: Unspecified error
[2018-05-07T12:37:20.039] error: slurm_receive_msg [10.112.0.16:50788]: Unspecified error

I ran the following command on all nodes (including master/headnode) and got “Success”

 munge -n | unmunge | grep STATUS
STATUS:           Success (0)

How can I fix this problem?

Thank you in advance for all your help.

Eric

_____________________________________________________________________________________________________

Eric F.  Alemany
System Administrator for Research

Division of Radiation & Cancer  Biology
Department of Radiation Oncology

Stanford University School of Medicine
Stanford, California 94305

Tel:1-650-498-7969<tel:1-650-498-7969>  No Texting
Fax:1-650-723-7382<tel:1-650-723-7382>

--
Andy Riebs
andy.riebs at hpe.com<mailto:andy.riebs at hpe.com>
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
    May the source be with you!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180507/ae525b1c/attachment-0001.html>