[slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05

Andrés Marín Díaz amarin at cesvima.upm.es
Thu Jun 6 10:11:35 UTC 2019


Thank you very much for the help, I update some information.

- If we use Intel MPI (IMPI) mpirun it works correctly.
- If we use mpirun without using the scheduler it works correctly.
- If we use srun with software compiled with OpenMPI it works correctly.
- If we use SLURM 18.08.6 it works correctly.
- If we use SLURM 19.05.0 and mpirun inside the sbatch script then we 
get the error:
--------------------------------------------------------------------------
     An ORTE daemon has unexpectedly failed after launch and before
     communicating back to mpirun. This could be caused by a number
     of factors, including an inability to create a connection back
     to mpirun due to lack of common network interfaces and / or no
     route found between them. Please check network connectivity
     (including firewalls and network routing requirements).
--------------------------------------------------------------------------

Trying to trace the problem:
- Mpirun is a binary and can not be traced with batch -x.
- I've done a "strace mpirun hostname" to see if it helps, but i am not 
able to see where the problem may be.

Here is the exit from the strace:
https://cloud.cesvima.upm.es/index.php/s/hWQMkwU5zW7J8RW

And here the slurmd log with verbose level 5:
Main node (slurmd log):
     2019-06-06T09:51:54.255743+00:00 r1n1 slurmd[108517]: _run_prolog: 
run job script took usec=7
     2019-06-06T09:51:54.256118+00:00 r1n1 slurmd[108517]: _run_prolog: 
prolog with lock for job 11057 ran for 0 seconds
     2019-06-06T09:51:54.258887+00:00 r1n1 slurmd[108517]: 
task_p_slurmd_batch_request: 11057
     2019-06-06T09:51:54.259317+00:00 r1n1 slurmd[108517]: 
task/affinity: job 11057 CPU input mask for node: 0x0000000001
     2019-06-06T09:51:54.259680+00:00 r1n1 slurmd[108517]: 
task/affinity: job 11057 CPU final HW mask for node: 0x0000000001
     2019-06-06T09:51:54.279614+00:00 r1n1 slurmstepd[108548]: task 
affinity plugin loaded with CPU mask 
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
     2019-06-06T09:51:54.280312+00:00 r1n1 slurmstepd[108548]: Munge 
credential signature plugin loaded
     2019-06-06T09:51:54.302921+00:00 r1n1 slurmstepd[108548]: 
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB 
memsw.limit=unlimited
     2019-06-06T09:51:54.303592+00:00 r1n1 slurmstepd[108548]: 
task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB 
mem.limit=1024MB memsw.limit=unlimited
     2019-06-06T09:51:54.330322+00:00 r1n1 slurmd[108517]: Launching 
batch job 11057 for UID 2000
     2019-06-06T09:51:54.353196+00:00 r1n1 slurmstepd[108556]: task 
affinity plugin loaded with CPU mask 
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
     2019-06-06T09:51:54.353899+00:00 r1n1 slurmstepd[108556]: Munge 
credential signature plugin loaded
     2019-06-06T09:51:54.366478+00:00 r1n1 slurmstepd[108556]: 
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB 
memsw.limit=unlimited
     2019-06-06T09:51:54.366755+00:00 r1n1 slurmstepd[108556]: 
task/cgroup: /slurm/uid_2000/job_11057/step_batch: alloc=1024MB 
mem.limit=1024MB memsw.limit=unlimited
     2019-06-06T09:51:54.393325+00:00 r1n1 slurmstepd[108556]: debug 
level = 2
     2019-06-06T09:51:54.393754+00:00 r1n1 slurmstepd[108556]: starting 
1 tasks
     2019-06-06T09:51:54.401243+00:00 r1n1 slurmstepd[108556]: task 0 
(108561) started 2019-06-06T09:51:54
     2019-06-06T09:51:54.416396+00:00 r1n1 slurmstepd[108561]: 
task_p_pre_launch: Using sched_affinity for tasks
     2019-06-06T09:51:56.514908+00:00 r1n1 slurmstepd[108556]: task 0 
(108561) exited with exit code 1.
     2019-06-06T09:51:56.554430+00:00 r1n1 slurmstepd[108556]: job 11057 
completed with slurm_rc = 0, job_rc = 256
     2019-06-06T09:51:56.554847+00:00 r1n1 slurmstepd[108556]: sending 
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
     2019-06-06T09:51:56.559856+00:00 r1n1 slurmstepd[108556]: done with job
     2019-06-06T09:51:56.596762+00:00 r1n1 slurmstepd[108548]: Sent 
signal 18 to 11057.4294967295
     2019-06-06T09:51:56.598072+00:00 r1n1 slurmstepd[108548]: Sent 
signal 15 to 11057.4294967295
     2019-06-06T09:51:56.599141+00:00 r1n1 slurmstepd[108548]: 
_oom_event_monitor: oom-kill event count: 1
     2019-06-06T09:51:56.641170+00:00 r1n1 slurmstepd[108548]: done with job

Secundary node (slurmd log):
     2019-06-06T09:51:54.256047+00:00 r1n2 slurmd[84916]: _run_prolog: 
run job script took usec=7
     2019-06-06T09:51:54.256432+00:00 r1n2 slurmd[84916]: _run_prolog: 
prolog with lock for job 11057 ran for 0 seconds
     2019-06-06T09:51:54.279763+00:00 r1n2 slurmstepd[84954]: task 
affinity plugin loaded with CPU mask 
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
     2019-06-06T09:51:54.280448+00:00 r1n2 slurmstepd[84954]: Munge 
credential signature plugin loaded
     2019-06-06T09:51:54.313852+00:00 r1n2 slurmstepd[84954]: 
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB 
memsw.limit=unlimited
     2019-06-06T09:51:54.314502+00:00 r1n2 slurmstepd[84954]: 
task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB 
mem.limit=1024MB memsw.limit=unlimited
     2019-06-06T09:51:56.564758+00:00 r1n2 slurmstepd[84954]: Sent 
signal 18 to 11057.4294967295
     2019-06-06T09:51:56.608028+00:00 r1n2 slurmstepd[84954]: Sent 
signal 15 to 11057.4294967295
     2019-06-06T09:51:56.609259+00:00 r1n2 slurmstepd[84954]: 
_oom_event_monitor: oom-kill event count: 1
     2019-06-06T09:51:56.638334+00:00 r1n2 slurmstepd[84954]: done with job

Thank you very much again.

-- 
  Andrés Marín Díaz
  
  Servicio de Infraestructura e Innovación
  Universidad Politécnica de Madrid
  
  Centro de Supercomputación y Visualización de Madrid (CeSViMa)
  Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
  amarin at cesvima.upm.es | tel 910679676
  
  www.cesvima.upm.es | www.twitter.com/cesvima | www.fb.com/cesvima
   




More information about the slurm-users mailing list