[slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05
Andrés Marín Díaz
amarin at cesvima.upm.es
Wed Jun 5 17:04:11 UTC 2019
Hello, since we have updated to the new slurm version (19.05) every time
a jobstep is launched with mpirun it ends with the following error message:
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to lack of common network interfaces and / or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
This only happens when it is launched to more than one node. If all
tasks run within the same node it works without problems
We have tested with different versions of OpenMPI (2.1.2, 3.1.1, 3.1.3),
all they compiled with the flags --with-slurm and --with-pmi. And in all
cases if the job is launched to nodes with slurm 18.05 it works with
both srun and mpirun. But if it is launched to nodes with slurm 19.05 it
works with srun but it fails with mpirun.
Can it be a bug in the new version?
Thank you.
--
Andrés Marín Díaz
Servicio de Infraestructura e Innovación
Universidad Politécnica de Madrid
Centro de Supercomputación y Visualización de Madrid (CeSViMa)
Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
amarin at cesvima.upm.es | tel 910679676
www.cesvima.upm.es | www.twitter.com/cesvima | www.fb.com/cesvima
More information about the slurm-users
mailing list