[slurm-users] Nodes are down after 2-3 minutes.

Paul Edmon pedmon at cfa.harvard.edu
Mon May 7 14:07:16 MDT 2018


Any command can be used to copy it.  We deploy ours using puppet.

-Paul Edmon-


On 05/07/2018 04:04 PM, Eric F. Alemany wrote:
> Thanks Andy.
>
> I think i omit a big step which is copying the /etc/munge/munge.key 
> from master/headnode to all the /etc/munge/munge/key in the nodes - am 
> i right?   i dont recall doing this so that could be the problem.
>
> Is there a specific command i need to do to copy the munge.key from 
> the master/headnode to all the nodes?
>
> Thank you for your help and sorry for such “beginner” questions.
>
> Best,
> Eric
> _____________________________________________________________________________________________________
>
> *
> *Eric F.  Alemany*
> *
> /System Administrator for Research/
>
> Division of Radiation & Cancer  Biology
> Department of Radiation Oncology
>
> Stanford University School of Medicine
> Stanford, California 94305
>
> Tel:1-650-498-7969 <tel:1-650-498-7969>  No Texting
> Fax:1-650-723-7382 <tel:1-650-723-7382>
>
>
>
>> On May 7, 2018, at 12:57 PM, Andy Riebs <andy.riebs at hpe.com 
>> <mailto:andy.riebs at hpe.com>> wrote:
>>
>> The two most likely causes of munge complaints:
>>
>> 1. Different keys in /etc/munge/munge.key
>> 2. Clocks out of sync on the nodes in question
>>
>> Andy
>>
>>
>> On 05/07/2018 03:50 PM, Eric F. Alemany wrote:
>>> Greetings,
>>>
>>> Reminder: i am new to SLURM.
>>>
>>> When i execute  “sinfo” my nodes are down.
>>>
>>> sinfo
>>> PARTITION AVAIL TIMELIMIT  NODES  STATE NODELIST
>>> debug*       up infinite      4  down* radonc[01-04]
>>>
>>> This is what i have done so far and nothing has helped. The nodes 
>>> are in “idle” state for 2-3 minutes and then there are “down” again.
>>>
>>> systemctl restart slurmd    on all nodes
>>>
>>> systemctl restart slurmctld  on master
>>>
>>> scontrol update node=radonc[01-04] state=UNDRAIN
>>>
>>> scontrol update node=radonc[01-04] state=IDLE
>>>
>>>
>>>
>>> I looked at the log file in /var/log/SlurmdLogFile.log  and saw some 
>>> “munge decode failed: Invalid credential”
>>>
>>> [2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: 
>>> MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid 
>>> credential
>>> [2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: Protocol 
>>> authentication error
>>> [2018-05-07T12:37:20.028] error: Munge decode failed: Invalid credential
>>> [2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: 
>>> MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid 
>>> credential
>>> [2018-05-07T12:37:20.028] error: slurm_unpack_received_msg: Protocol 
>>> authentication error
>>> [2018-05-07T12:37:20.038] error: slurm_receive_msg 
>>> [10.112.0.14:42140]: Unspecified error
>>> [2018-05-07T12:37:20.038] error: slurm_receive_msg 
>>> [10.112.0.5:34752]: Unspecified error
>>> [2018-05-07T12:37:20.038] error: slurm_receive_msg 
>>> [10.112.0.6:46746]: Unspecified error
>>> [2018-05-07T12:37:20.039] error: slurm_receive_msg 
>>> [10.112.0.16:50788]: Unspecified error
>>>
>>>
>>> I ran the following command on all nodes (including master/headnode) 
>>> and got “Success”
>>>
>>>  munge -n | unmunge | grep STATUS
>>> *STATUS*:           Success (0)
>>>
>>>
>>> How can I fix this problem?
>>>
>>>
>>> Thank you in advance for all your help.
>>>
>>> Eric
>>>
>>>
>>> _____________________________________________________________________________________________________
>>>
>>> *
>>> *Eric F.  Alemany*
>>> *
>>> /System Administrator for Research/
>>>
>>> Division of Radiation & Cancer  Biology
>>> Department of Radiation Oncology
>>>
>>> Stanford University School of Medicine
>>> Stanford, California 94305
>>>
>>> Tel:1-650-498-7969 <tel:1-650-498-7969>  No Texting
>>> Fax:1-650-723-7382 <tel:1-650-723-7382>
>>>
>>>
>>>
>>
>> -- 
>> Andy Riebs
>> andy.riebs at hpe.com
>> Hewlett-Packard Enterprise
>> High Performance Computing Software Engineering
>> +1 404 648 9024
>> My opinions are not necessarily those of HPE
>>      May the source be with you!
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180507/8cde42aa/attachment-0001.html>


More information about the slurm-users mailing list