<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
That was my first thought too, but... no. Both /etc/hosts (not used)
and slurm.conf are identical on all nodes, both working and
non-working nodes.<br>
<br>
<u>From login machine:</u><br>
<tt>[alex@li1 ~]$ srun --nodelist=cn7 ping -c 1 cn7</tt><tt><br>
</tt><tt>srun: job 1118071 queued and waiting for resources</tt><tt><br>
</tt><tt>srun: job 1118071 has been allocated resources</tt><tt><br>
</tt><tt>srun: error: fwd_tree_thread: can't find address for host
cn7, check slurm.conf</tt><tt><br>
</tt><tt>srun: error: Task launch for 1118071.0 failed on node cn7:
Can't find an address, check slurm.conf</tt><tt><br>
</tt><tt>srun: error: Application launch failed: Can't find an
address, check slurm.conf</tt><tt><br>
</tt><tt>srun: Job step aborted: Waiting up to 32 seconds for job
step to finish.</tt><tt><br>
</tt><tt>srun: error: Timed out waiting for job step to complete</tt><br>
<br>
<u>From slurmctld machine:</u><br>
<tt>[root@cmgr1 ~]# srun --nodelist=cn7 ping -c 1 cn7</tt><tt><br>
</tt><tt>srun: job 1118076 queued and waiting for resources</tt><tt><br>
</tt><tt>srun: job 1118076 has been allocated resources</tt><tt><br>
</tt><tt>PING cn7.ydesign.se (10.28.3.137) 56(84) bytes of data.</tt><tt><br>
</tt><tt>64 bytes from cn7.ydesign.se (10.28.3.137): icmp_seq=1
ttl=64 time=0.012 ms</tt><tt><br>
</tt><tt><br>
</tt><tt>--- cn7.ydesign.se ping statistics ---</tt><tt><br>
</tt><tt>1 packets transmitted, 1 received, 0% packet loss, time 0ms</tt><tt><br>
</tt><tt>rtt min/avg/max/mdev = 0.012/0.012/0.012/0.000 ms</tt><br>
<br>
<br>
I guess that some state file somewhere got corrupted. Think the new
mission will be to try to reset the correct state file and try again
or if that fails - clean it with fire! ;-)<br>
<br>
Regards,<br>
Alexander Åhman<br>
<br>
<br>
<br>
<div class="moz-cite-prefix">Den 2019-05-29 kl. 19:23, skrev Alex
Chekholko:<br>
</div>
<blockquote type="cite"
cite="mid:CANcy_PaX7B3y0gk7K7JuAwqWLcfEbFLBMsZaDBgo_KB0TnyVVQ@mail.gmail.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<div dir="ltr">I think this error usually means that on your node
cn7 it has either the wrong /etc/hosts or the wrong
/etc/slurm/slurm.conf
<div><br>
</div>
<div>E.g. try 'srun --nodelist=cn7 ping -c 1 cn7'</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, May 29, 2019 at 6:00
AM Alexander Åhman <<a href="mailto:alexander@ydesign.se"
moz-do-not-send="true">alexander@ydesign.se</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>
Have a very strange problem. The cluster has been working just
fine <br>
until one node died and now I can't submit jobs to 2 of the
nodes using <br>
srun from the login machine. Using sbatch works just fine and
also if I <br>
use srun from the same host as slurmctld.<br>
All the other nodes works just fine as they always has, only 2
nodes are <br>
experiencing this problem. Very strange...<br>
<br>
Have checked network connectivity and DNS and that is OK. I
can ping, <br>
ssh to all nodes just fine. All nodes are identical and using
Slurm 18.08.<br>
Also tested to reboot the 2 nodes and slurmctld but still same
problem.<br>
<br>
[alex@li1 ~]$ srun -w cn7 hostname<br>
srun: error: fwd_tree_thread: can't find address for host cn7,
check <br>
slurm.conf<br>
srun: error: Task launch for 1088816.0 failed on node cn7:
Can't find an <br>
address, check slurm.conf<br>
srun: error: Application launch failed: Can't find an address,
check <br>
slurm.conf<br>
srun: Job step aborted: Waiting up to 32 seconds for job step
to finish.<br>
srun: error: Timed out waiting for job step to complete<br>
<br>
[alex@li1 ~]$ srun -w cn6 hostname<br>
<a href="http://cn6.ydesign.se" rel="noreferrer"
target="_blank" moz-do-not-send="true">cn6.ydesign.se</a><br>
<br>
What is this error "can't find address for host" about? Have
searched <br>
the web but can't find any good information about what the
problem is or <br>
what to do to resolve it.<br>
<br>
Any kind soul out there who knows what to do next?<br>
<br>
Regards,<br>
Alexander Åhman<br>
<br>
<br>
</blockquote>
</div>
</blockquote>
<br>
</body>
</html>