[slurm-users] Submit job using srun fails but sbatch works
Alexander Åhman
alexander at ydesign.se
Mon Jun 3 14:53:39 UTC 2019
That was my first thought too, but... no. Both /etc/hosts (not used) and
slurm.conf are identical on all nodes, both working and non-working nodes.
_From login machine:_
[alex at li1 ~]$ srun --nodelist=cn7 ping -c 1 cn7
srun: job 1118071 queued and waiting for resources
srun: job 1118071 has been allocated resources
srun: error: fwd_tree_thread: can't find address for host cn7, check
slurm.conf
srun: error: Task launch for 1118071.0 failed on node cn7: Can't find an
address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check
slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
_From slurmctld machine:_
[root at cmgr1 ~]# srun --nodelist=cn7 ping -c 1 cn7
srun: job 1118076 queued and waiting for resources
srun: job 1118076 has been allocated resources
PING cn7.ydesign.se (10.28.3.137) 56(84) bytes of data.
64 bytes from cn7.ydesign.se (10.28.3.137): icmp_seq=1 ttl=64 time=0.012 ms
--- cn7.ydesign.se ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.012/0.012/0.012/0.000 ms
I guess that some state file somewhere got corrupted. Think the new
mission will be to try to reset the correct state file and try again or
if that fails - clean it with fire! ;-)
Regards,
Alexander Åhman
Den 2019-05-29 kl. 19:23, skrev Alex Chekholko:
> I think this error usually means that on your node cn7 it has either
> the wrong /etc/hosts or the wrong /etc/slurm/slurm.conf
>
> E.g. try 'srun --nodelist=cn7 ping -c 1 cn7'
>
> On Wed, May 29, 2019 at 6:00 AM Alexander Åhman <alexander at ydesign.se
> <mailto:alexander at ydesign.se>> wrote:
>
> Hi,
> Have a very strange problem. The cluster has been working just fine
> until one node died and now I can't submit jobs to 2 of the nodes
> using
> srun from the login machine. Using sbatch works just fine and also
> if I
> use srun from the same host as slurmctld.
> All the other nodes works just fine as they always has, only 2
> nodes are
> experiencing this problem. Very strange...
>
> Have checked network connectivity and DNS and that is OK. I can ping,
> ssh to all nodes just fine. All nodes are identical and using
> Slurm 18.08.
> Also tested to reboot the 2 nodes and slurmctld but still same
> problem.
>
> [alex at li1 ~]$ srun -w cn7 hostname
> srun: error: fwd_tree_thread: can't find address for host cn7, check
> slurm.conf
> srun: error: Task launch for 1088816.0 failed on node cn7: Can't
> find an
> address, check slurm.conf
> srun: error: Application launch failed: Can't find an address, check
> slurm.conf
> srun: Job step aborted: Waiting up to 32 seconds for job step to
> finish.
> srun: error: Timed out waiting for job step to complete
>
> [alex at li1 ~]$ srun -w cn6 hostname
> cn6.ydesign.se <http://cn6.ydesign.se>
>
> What is this error "can't find address for host" about? Have searched
> the web but can't find any good information about what the problem
> is or
> what to do to resolve it.
>
> Any kind soul out there who knows what to do next?
>
> Regards,
> Alexander Åhman
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190603/7ff642ea/attachment.html>
More information about the slurm-users
mailing list