[slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05
Sean Crosby
scrosby at unimelb.edu.au
Thu Jun 6 10:46:02 UTC 2019
Hi Andrés,
Did you recompile OpenMPI after updating to SLURM 19.05?
Sean
--
Sean Crosby
Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services
Research Computing | CoEPP | School of Physics
University of Melbourne
On Thu, 6 Jun 2019 at 20:11, Andrés Marín Díaz <amarin at cesvima.upm.es<mailto:amarin at cesvima.upm.es>> wrote:
Thank you very much for the help, I update some information.
- If we use Intel MPI (IMPI) mpirun it works correctly.
- If we use mpirun without using the scheduler it works correctly.
- If we use srun with software compiled with OpenMPI it works correctly.
- If we use SLURM 18.08.6 it works correctly.
- If we use SLURM 19.05.0 and mpirun inside the sbatch script then we
get the error:
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to lack of common network interfaces and / or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
Trying to trace the problem:
- Mpirun is a binary and can not be traced with batch -x.
- I've done a "strace mpirun hostname" to see if it helps, but i am not
able to see where the problem may be.
Here is the exit from the strace:
https://cloud.cesvima.upm.es/index.php/s/hWQMkwU5zW7J8RW
And here the slurmd log with verbose level 5:
Main node (slurmd log):
2019-06-06T09:51:54.255743+00:00 r1n1 slurmd[108517]: _run_prolog:
run job script took usec=7
2019-06-06T09:51:54.256118+00:00 r1n1 slurmd[108517]: _run_prolog:
prolog with lock for job 11057 ran for 0 seconds
2019-06-06T09:51:54.258887+00:00 r1n1 slurmd[108517]:
task_p_slurmd_batch_request: 11057
2019-06-06T09:51:54.259317+00:00 r1n1 slurmd[108517]:
task/affinity: job 11057 CPU input mask for node: 0x0000000001
2019-06-06T09:51:54.259680+00:00 r1n1 slurmd[108517]:
task/affinity: job 11057 CPU final HW mask for node: 0x0000000001
2019-06-06T09:51:54.279614+00:00 r1n1 slurmstepd[108548]: task
affinity plugin loaded with CPU mask
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
2019-06-06T09:51:54.280312+00:00 r1n1 slurmstepd[108548]: Munge
credential signature plugin loaded
2019-06-06T09:51:54.302921+00:00 r1n1 slurmstepd[108548]:
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB
memsw.limit=unlimited
2019-06-06T09:51:54.303592+00:00 r1n1 slurmstepd[108548]:
task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB
mem.limit=1024MB memsw.limit=unlimited
2019-06-06T09:51:54.330322+00:00 r1n1 slurmd[108517]: Launching
batch job 11057 for UID 2000
2019-06-06T09:51:54.353196+00:00 r1n1 slurmstepd[108556]: task
affinity plugin loaded with CPU mask
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
2019-06-06T09:51:54.353899+00:00 r1n1 slurmstepd[108556]: Munge
credential signature plugin loaded
2019-06-06T09:51:54.366478+00:00 r1n1 slurmstepd[108556]:
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB
memsw.limit=unlimited
2019-06-06T09:51:54.366755+00:00 r1n1 slurmstepd[108556]:
task/cgroup: /slurm/uid_2000/job_11057/step_batch: alloc=1024MB
mem.limit=1024MB memsw.limit=unlimited
2019-06-06T09:51:54.393325+00:00 r1n1 slurmstepd[108556]: debug
level = 2
2019-06-06T09:51:54.393754+00:00 r1n1 slurmstepd[108556]: starting
1 tasks
2019-06-06T09:51:54.401243+00:00 r1n1 slurmstepd[108556]: task 0
(108561) started 2019-06-06T09:51:54
2019-06-06T09:51:54.416396+00:00 r1n1 slurmstepd[108561]:
task_p_pre_launch: Using sched_affinity for tasks
2019-06-06T09:51:56.514908+00:00 r1n1 slurmstepd[108556]: task 0
(108561) exited with exit code 1.
2019-06-06T09:51:56.554430+00:00 r1n1 slurmstepd[108556]: job 11057
completed with slurm_rc = 0, job_rc = 256
2019-06-06T09:51:56.554847+00:00 r1n1 slurmstepd[108556]: sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
2019-06-06T09:51:56.559856+00:00 r1n1 slurmstepd[108556]: done with job
2019-06-06T09:51:56.596762+00:00 r1n1 slurmstepd[108548]: Sent
signal 18 to 11057.4294967295
2019-06-06T09:51:56.598072+00:00 r1n1 slurmstepd[108548]: Sent
signal 15 to 11057.4294967295
2019-06-06T09:51:56.599141+00:00 r1n1 slurmstepd[108548]:
_oom_event_monitor: oom-kill event count: 1
2019-06-06T09:51:56.641170+00:00 r1n1 slurmstepd[108548]: done with job
Secundary node (slurmd log):
2019-06-06T09:51:54.256047+00:00 r1n2 slurmd[84916]: _run_prolog:
run job script took usec=7
2019-06-06T09:51:54.256432+00:00 r1n2 slurmd[84916]: _run_prolog:
prolog with lock for job 11057 ran for 0 seconds
2019-06-06T09:51:54.279763+00:00 r1n2 slurmstepd[84954]: task
affinity plugin loaded with CPU mask
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
2019-06-06T09:51:54.280448+00:00 r1n2 slurmstepd[84954]: Munge
credential signature plugin loaded
2019-06-06T09:51:54.313852+00:00 r1n2 slurmstepd[84954]:
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB
memsw.limit=unlimited
2019-06-06T09:51:54.314502+00:00 r1n2 slurmstepd[84954]:
task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB
mem.limit=1024MB memsw.limit=unlimited
2019-06-06T09:51:56.564758+00:00 r1n2 slurmstepd[84954]: Sent
signal 18 to 11057.4294967295
2019-06-06T09:51:56.608028+00:00 r1n2 slurmstepd[84954]: Sent
signal 15 to 11057.4294967295
2019-06-06T09:51:56.609259+00:00 r1n2 slurmstepd[84954]:
_oom_event_monitor: oom-kill event count: 1
2019-06-06T09:51:56.638334+00:00 r1n2 slurmstepd[84954]: done with job
Thank you very much again.
--
Andrés Marín Díaz
Servicio de Infraestructura e Innovación
Universidad Politécnica de Madrid
Centro de Supercomputación y Visualización de Madrid (CeSViMa)
Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
amarin at cesvima.upm.es<mailto:amarin at cesvima.upm.es> | tel 910679676
www.cesvima.upm.es<http://www.cesvima.upm.es> | www.twitter.com/cesvima<http://www.twitter.com/cesvima> | www.fb.com/cesvima<http://www.fb.com/cesvima>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190606/b38d8c40/attachment-0001.html>
More information about the slurm-users
mailing list