[slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05

Thu Jun 6 10:46:02 UTC 2019

Hi Andrés,

Did you recompile OpenMPI after updating to SLURM 19.05?

Sean

--
Sean Crosby
Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services
Research Computing | CoEPP | School of Physics
University of Melbourne

On Thu, 6 Jun 2019 at 20:11, Andrés Marín Díaz <amarin at cesvima.upm.es<mailto:amarin at cesvima.upm.es>> wrote:
Thank you very much for the help, I update some information.

- If we use Intel MPI (IMPI) mpirun it works correctly.
- If we use mpirun without using the scheduler it works correctly.
- If we use srun with software compiled with OpenMPI it works correctly.
- If we use SLURM 18.08.6 it works correctly.
- If we use SLURM 19.05.0 and mpirun inside the sbatch script then we
get the error:
--------------------------------------------------------------------------
     An ORTE daemon has unexpectedly failed after launch and before
     communicating back to mpirun. This could be caused by a number
     of factors, including an inability to create a connection back
     to mpirun due to lack of common network interfaces and / or no
     route found between them. Please check network connectivity
     (including firewalls and network routing requirements).
--------------------------------------------------------------------------

Trying to trace the problem:
- Mpirun is a binary and can not be traced with batch -x.
- I've done a "strace mpirun hostname" to see if it helps, but i am not
able to see where the problem may be.

Here is the exit from the strace:
https://cloud.cesvima.upm.es/index.php/s/hWQMkwU5zW7J8RW

And here the slurmd log with verbose level 5:
Main node (slurmd log):
     2019-06-06T09:51:54.255743+00:00 r1n1 slurmd[108517]: _run_prolog:
run job script took usec=7
     2019-06-06T09:51:54.256118+00:00 r1n1 slurmd[108517]: _run_prolog:
prolog with lock for job 11057 ran for 0 seconds
     2019-06-06T09:51:54.258887+00:00 r1n1 slurmd[108517]:
task_p_slurmd_batch_request: 11057
     2019-06-06T09:51:54.259317+00:00 r1n1 slurmd[108517]:
task/affinity: job 11057 CPU input mask for node: 0x0000000001
     2019-06-06T09:51:54.259680+00:00 r1n1 slurmd[108517]:
task/affinity: job 11057 CPU final HW mask for node: 0x0000000001
     2019-06-06T09:51:54.279614+00:00 r1n1 slurmstepd[108548]: task
affinity plugin loaded with CPU mask
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
     2019-06-06T09:51:54.280312+00:00 r1n1 slurmstepd[108548]: Munge
credential signature plugin loaded
     2019-06-06T09:51:54.302921+00:00 r1n1 slurmstepd[108548]:
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB
memsw.limit=unlimited
     2019-06-06T09:51:54.303592+00:00 r1n1 slurmstepd[108548]:
task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB
mem.limit=1024MB memsw.limit=unlimited
     2019-06-06T09:51:54.330322+00:00 r1n1 slurmd[108517]: Launching
batch job 11057 for UID 2000
     2019-06-06T09:51:54.353196+00:00 r1n1 slurmstepd[108556]: task
affinity plugin loaded with CPU mask
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
     2019-06-06T09:51:54.353899+00:00 r1n1 slurmstepd[108556]: Munge
credential signature plugin loaded
     2019-06-06T09:51:54.366478+00:00 r1n1 slurmstepd[108556]:
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB
memsw.limit=unlimited
     2019-06-06T09:51:54.366755+00:00 r1n1 slurmstepd[108556]:
task/cgroup: /slurm/uid_2000/job_11057/step_batch: alloc=1024MB
mem.limit=1024MB memsw.limit=unlimited
     2019-06-06T09:51:54.393325+00:00 r1n1 slurmstepd[108556]: debug
level = 2
     2019-06-06T09:51:54.393754+00:00 r1n1 slurmstepd[108556]: starting
1 tasks
     2019-06-06T09:51:54.401243+00:00 r1n1 slurmstepd[108556]: task 0
(108561) started 2019-06-06T09:51:54
     2019-06-06T09:51:54.416396+00:00 r1n1 slurmstepd[108561]:
task_p_pre_launch: Using sched_affinity for tasks
     2019-06-06T09:51:56.514908+00:00 r1n1 slurmstepd[108556]: task 0
(108561) exited with exit code 1.
     2019-06-06T09:51:56.554430+00:00 r1n1 slurmstepd[108556]: job 11057
completed with slurm_rc = 0, job_rc = 256
     2019-06-06T09:51:56.554847+00:00 r1n1 slurmstepd[108556]: sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
     2019-06-06T09:51:56.559856+00:00 r1n1 slurmstepd[108556]: done with job
     2019-06-06T09:51:56.596762+00:00 r1n1 slurmstepd[108548]: Sent
signal 18 to 11057.4294967295
     2019-06-06T09:51:56.598072+00:00 r1n1 slurmstepd[108548]: Sent
signal 15 to 11057.4294967295
     2019-06-06T09:51:56.599141+00:00 r1n1 slurmstepd[108548]:
_oom_event_monitor: oom-kill event count: 1
     2019-06-06T09:51:56.641170+00:00 r1n1 slurmstepd[108548]: done with job

Secundary node (slurmd log):
     2019-06-06T09:51:54.256047+00:00 r1n2 slurmd[84916]: _run_prolog:
run job script took usec=7
     2019-06-06T09:51:54.256432+00:00 r1n2 slurmd[84916]: _run_prolog:
prolog with lock for job 11057 ran for 0 seconds
     2019-06-06T09:51:54.279763+00:00 r1n2 slurmstepd[84954]: task
affinity plugin loaded with CPU mask
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
     2019-06-06T09:51:54.280448+00:00 r1n2 slurmstepd[84954]: Munge
credential signature plugin loaded
     2019-06-06T09:51:54.313852+00:00 r1n2 slurmstepd[84954]:
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB
memsw.limit=unlimited
     2019-06-06T09:51:54.314502+00:00 r1n2 slurmstepd[84954]:
task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB
mem.limit=1024MB memsw.limit=unlimited
     2019-06-06T09:51:56.564758+00:00 r1n2 slurmstepd[84954]: Sent
signal 18 to 11057.4294967295
     2019-06-06T09:51:56.608028+00:00 r1n2 slurmstepd[84954]: Sent
signal 15 to 11057.4294967295
     2019-06-06T09:51:56.609259+00:00 r1n2 slurmstepd[84954]:
_oom_event_monitor: oom-kill event count: 1
     2019-06-06T09:51:56.638334+00:00 r1n2 slurmstepd[84954]: done with job

Thank you very much again.

--
  Andrés Marín Díaz

  Servicio de Infraestructura e Innovación
  Universidad Politécnica de Madrid

  Centro de Supercomputación y Visualización de Madrid (CeSViMa)
  Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
  amarin at cesvima.upm.es<mailto:amarin at cesvima.upm.es> | tel 910679676

  www.cesvima.upm.es<http://www.cesvima.upm.es> | www.twitter.com/cesvima<http://www.twitter.com/cesvima> | www.fb.com/cesvima<http://www.fb.com/cesvima>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190606/b38d8c40/attachment-0001.html>