[slurm-users] Submit job using srun fails but sbatch works

Mon Jun 3 14:53:39 UTC 2019

That was my first thought too, but... no. Both /etc/hosts (not used) and 
slurm.conf are identical on all nodes, both working and non-working nodes.

_From login machine:_
[alex at li1 ~]$ srun --nodelist=cn7 ping -c 1 cn7
srun: job 1118071 queued and waiting for resources
srun: job 1118071 has been allocated resources
srun: error: fwd_tree_thread: can't find address for host cn7, check 
slurm.conf
srun: error: Task launch for 1118071.0 failed on node cn7: Can't find an 
address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check 
slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

_From slurmctld machine:_
[root at cmgr1 ~]# srun --nodelist=cn7 ping -c 1 cn7
srun: job 1118076 queued and waiting for resources
srun: job 1118076 has been allocated resources
PING cn7.ydesign.se (10.28.3.137) 56(84) bytes of data.
64 bytes from cn7.ydesign.se (10.28.3.137): icmp_seq=1 ttl=64 time=0.012 ms

--- cn7.ydesign.se ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.012/0.012/0.012/0.000 ms

I guess that some state file somewhere got corrupted. Think the new 
mission will be to try to reset the correct state file and try again or 
if that fails - clean it with fire! ;-)

Regards,
Alexander Åhman

Den 2019-05-29 kl. 19:23, skrev Alex Chekholko:
> I think this error usually means that on your node cn7 it has either 
> the wrong /etc/hosts or the wrong /etc/slurm/slurm.conf
>
> E.g. try 'srun --nodelist=cn7 ping -c 1 cn7'
>
> On Wed, May 29, 2019 at 6:00 AM Alexander Åhman <alexander at ydesign.se 
> <mailto:alexander at ydesign.se>> wrote:
>
>     Hi,
>     Have a very strange problem. The cluster has been working just fine
>     until one node died and now I can't submit jobs to 2 of the nodes
>     using
>     srun from the login machine. Using sbatch works just fine and also
>     if I
>     use srun from the same host as slurmctld.
>     All the other nodes works just fine as they always has, only 2
>     nodes are
>     experiencing this problem. Very strange...
>
>     Have checked network connectivity and DNS and that is OK. I can ping,
>     ssh to all nodes just fine. All nodes are identical and using
>     Slurm 18.08.
>     Also tested to reboot the 2 nodes and slurmctld but still same
>     problem.
>
>     [alex at li1 ~]$ srun -w cn7 hostname
>     srun: error: fwd_tree_thread: can't find address for host cn7, check
>     slurm.conf
>     srun: error: Task launch for 1088816.0 failed on node cn7: Can't
>     find an
>     address, check slurm.conf
>     srun: error: Application launch failed: Can't find an address, check
>     slurm.conf
>     srun: Job step aborted: Waiting up to 32 seconds for job step to
>     finish.
>     srun: error: Timed out waiting for job step to complete
>
>     [alex at li1 ~]$ srun -w cn6 hostname
>     cn6.ydesign.se <http://cn6.ydesign.se>
>
>     What is this error "can't find address for host" about? Have searched
>     the web but can't find any good information about what the problem
>     is or
>     what to do to resolve it.
>
>     Any kind soul out there who knows what to do next?
>
>     Regards,
>     Alexander Åhman
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190603/7ff642ea/attachment.html>