[slurm-users] Can't find an address

Andy Riebs andy.riebs at hpe.com
Thu Oct 25 04:57:37 MDT 2018

Make sure that the "hostname" command returns the same name that Slurm 
expects on your compute nodes.

*From:* Zohar Roe Mlm <RZohar8 at iai.co.il>
*Sent:* Thursday, October 25, 2018 3:02AM
*To:* 'Slurm User Community List' <slurm-users at lists.schedmd.com>
*Subject:* Re: [slurm-users] Can't find an address

Hi Lachlan,

Thanks for the replay. I am trying to find more Ideas for this problem. 
May be some system or strange communication problem.

As for your suggestion:

>  Check that it's in /etc/hosts --> It is. And answer to ping both on ip 
and host name every time I check

>  Check the slurmd logs --> In the node log there is no error, In the 
server log there is the error I wrote ("agent/is_node_resp: node:myName1 
RPC:REQUEST_PING : Can't find an address, check slurm.conf ")

>  Make sure there is enough disk space --> More than enough

>  Make sure that it's datetime is synchronized with the others  --> Same 
time and date on all nodes and Slurm server.

The problem is that I don't see any other error and the node is up and 
running without any error.

The communication looks good with good ping but still it looks like the 
server can't find it (And it happen every two minute, always).

Thanks for your ideas,


*From:*slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] *On 
Behalf Of *Lachlan Musicman
*Sent:* Thursday, October 25, 2018 1:59 AM
*To:* Slurm User Community List
*Subject:* Re: [slurm-users] Can't find an address

On Wed, 24 Oct 2018 at 22:56, Zohar Roe MLM <RZohar8 at iai.co.il 
<mailto:RZohar8 at iai.co.il>> wrote:


    I have a node that from some reason change state to "Down" evert few

    When I change it with scontrol to "resume" its ok until Down again.

    In the slurm server log I can see error:

    "agent/is_node_resp: node:myName1 RPC:REQUEST_PING : Can't find an
    address, check slurm.conf"

    Now, The error message seems kind of straight forward but I can't
    find the problem.

    * The node is up and answer to ping from the slurm server.

    * The slurm deamon on the node is up and running.

    * There isn't any error on the node itself.

    * There are more node, configure the same (except from the ip
    address) that are Ok.

    * running "scontrol update state=eesume nodename"myNode" fix the
    problem for a short time

    * restarting slurm deamon on node also fix this for a short time

    Any idea what more I can check to resolve this?

Here's a quick top of my head checklist:

Check that it's in /etc/hosts

Check the slurmd logs

Make sure there is enough disk space

Make sure that it's datetime is synchronized with the others




'...postwork futures are dismissed with the claim that "it is not in our 
nature to be idle", thereby demonstrating at once an essentialist view 
of labor and an impoverished imagination of the possibilities of nonwork.'

Kathi Weeks, /The Problem with Work: Feminism, Marxism, Antiwork 
Politics and Postwork Imaginaries/ 

Default Profile

Please consider the environment before printing this email ! The 
information contained in this communication is proprietary to Israel 
Aerospace Industries Ltd. and/or third parties, may contain confidential 
or privileged information, and is intended only for the use of the 
intended addressee thereof. If you are not the intended addressee, 
please be aware that any use, disclosure, distribution and/or copying of 
this communication is strictly prohibited. If you receive this 
communication in error, please notify the sender immediately and delete 
it from your computer. Thank you. Visit us at: www.iai.co.il

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181025/08bda068/attachment.html>

More information about the slurm-users mailing list