[slurm-users] Submit job using srun fails but sbatch works

Alex Chekholko alex at calicolabs.com
Wed May 29 17:23:51 UTC 2019


I think this error usually means that on your node cn7 it has either the
wrong /etc/hosts or the wrong /etc/slurm/slurm.conf

E.g. try 'srun --nodelist=cn7 ping -c 1 cn7'

On Wed, May 29, 2019 at 6:00 AM Alexander Åhman <alexander at ydesign.se>
wrote:

> Hi,
> Have a very strange problem. The cluster has been working just fine
> until one node died and now I can't submit jobs to 2 of the nodes using
> srun from the login machine. Using sbatch works just fine and also if I
> use srun from the same host as slurmctld.
> All the other nodes works just fine as they always has, only 2 nodes are
> experiencing this problem. Very strange...
>
> Have checked network connectivity and DNS and that is OK. I can ping,
> ssh to all nodes just fine. All nodes are identical and using Slurm 18.08.
> Also tested to reboot the 2 nodes and slurmctld but still same problem.
>
> [alex at li1 ~]$ srun -w cn7 hostname
> srun: error: fwd_tree_thread: can't find address for host cn7, check
> slurm.conf
> srun: error: Task launch for 1088816.0 failed on node cn7: Can't find an
> address, check slurm.conf
> srun: error: Application launch failed: Can't find an address, check
> slurm.conf
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: Timed out waiting for job step to complete
>
> [alex at li1 ~]$ srun -w cn6 hostname
> cn6.ydesign.se
>
> What is this error "can't find address for host" about? Have searched
> the web but can't find any good information about what the problem is or
> what to do to resolve it.
>
> Any kind soul out there who knows what to do next?
>
> Regards,
> Alexander Åhman
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190529/971e4f31/attachment.html>


More information about the slurm-users mailing list