[slurm-users] Submit job using srun fails but sbatch works

Alexander Åhman alexander at ydesign.se
Wed May 29 13:00:50 UTC 2019


Hi,
Have a very strange problem. The cluster has been working just fine 
until one node died and now I can't submit jobs to 2 of the nodes using 
srun from the login machine. Using sbatch works just fine and also if I 
use srun from the same host as slurmctld.
All the other nodes works just fine as they always has, only 2 nodes are 
experiencing this problem. Very strange...

Have checked network connectivity and DNS and that is OK. I can ping, 
ssh to all nodes just fine. All nodes are identical and using Slurm 18.08.
Also tested to reboot the 2 nodes and slurmctld but still same problem.

[alex at li1 ~]$ srun -w cn7 hostname
srun: error: fwd_tree_thread: can't find address for host cn7, check 
slurm.conf
srun: error: Task launch for 1088816.0 failed on node cn7: Can't find an 
address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check 
slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

[alex at li1 ~]$ srun -w cn6 hostname
cn6.ydesign.se

What is this error "can't find address for host" about? Have searched 
the web but can't find any good information about what the problem is or 
what to do to resolve it.

Any kind soul out there who knows what to do next?

Regards,
Alexander Åhman




More information about the slurm-users mailing list