[slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05

Andrés Marín Díaz amarin at cesvima.upm.es
Wed Jun 5 17:04:11 UTC 2019


Hello, since we have updated to the new slurm version (19.05) every time 
a jobstep is launched with mpirun it ends with the following error message:

     An ORTE daemon has unexpectedly failed after launch and before
     communicating back to mpirun. This could be caused by a number
     of factors, including an inability to create a connection back
     to mpirun due to lack of common network interfaces and / or no
     route found between them. Please check network connectivity
     (including firewalls and network routing requirements).

This only happens when it is launched to more than one node. If all 
tasks run within the same node it works without problems

We have tested with different versions of OpenMPI (2.1.2, 3.1.1, 3.1.3), 
all they compiled with the flags --with-slurm and --with-pmi. And in all 
cases if the job is launched to nodes with slurm 18.05 it works with 
both srun and mpirun. But if it is launched to nodes with slurm 19.05 it 
works with srun but it fails with mpirun.

Can it be a bug in the new version?

Thank you.


-- 
  Andrés Marín Díaz
  
  Servicio de Infraestructura e Innovación
  Universidad Politécnica de Madrid
  
  Centro de Supercomputación y Visualización de Madrid (CeSViMa)
  Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
  amarin at cesvima.upm.es | tel 910679676
  
  www.cesvima.upm.es | www.twitter.com/cesvima | www.fb.com/cesvima
   




More information about the slurm-users mailing list