<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<div dir="ltr">
<div>Hi Andrés,</div>
<div><br>
</div>
<div>Did you recompile OpenMPI after updating to SLURM 19.05?</div>
<div><br>
</div>
<div>Sean<br>
</div>
<div>
<div>
<div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<div dir="ltr">--</div>
<div dir="ltr">Sean Crosby</div>
<div dir="ltr">
<div dir="ltr">Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services</div>
<div>Research Computing | CoEPP | School of Physics<br>
</div>
</div>
<div dir="ltr"><span style="font-size:12.8px">University of Melbourne</span><br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, 6 Jun 2019 at 20:11, Andrés Marín Díaz <<a href="mailto:amarin@cesvima.upm.es">amarin@cesvima.upm.es</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Thank you very much for the help, I update some information.<br>
<br>
- If we use Intel MPI (IMPI) mpirun it works correctly.<br>
- If we use mpirun without using the scheduler it works correctly.<br>
- If we use srun with software compiled with OpenMPI it works correctly.<br>
- If we use SLURM 18.08.6 it works correctly.<br>
- If we use SLURM 19.05.0 and mpirun inside the sbatch script then we <br>
get the error:<br>
--------------------------------------------------------------------------<br>
An ORTE daemon has unexpectedly failed after launch and before<br>
communicating back to mpirun. This could be caused by a number<br>
of factors, including an inability to create a connection back<br>
to mpirun due to lack of common network interfaces and / or no<br>
route found between them. Please check network connectivity<br>
(including firewalls and network routing requirements).<br>
--------------------------------------------------------------------------<br>
<br>
Trying to trace the problem:<br>
- Mpirun is a binary and can not be traced with batch -x.<br>
- I've done a "strace mpirun hostname" to see if it helps, but i am not <br>
able to see where the problem may be.<br>
<br>
Here is the exit from the strace:<br>
<a href="https://cloud.cesvima.upm.es/index.php/s/hWQMkwU5zW7J8RW" rel="noreferrer" target="_blank">https://cloud.cesvima.upm.es/index.php/s/hWQMkwU5zW7J8RW</a><br>
<br>
And here the slurmd log with verbose level 5:<br>
Main node (slurmd log):<br>
2019-06-06T09:51:54.255743+00:00 r1n1 slurmd[108517]: _run_prolog: <br>
run job script took usec=7<br>
2019-06-06T09:51:54.256118+00:00 r1n1 slurmd[108517]: _run_prolog: <br>
prolog with lock for job 11057 ran for 0 seconds<br>
2019-06-06T09:51:54.258887+00:00 r1n1 slurmd[108517]: <br>
task_p_slurmd_batch_request: 11057<br>
2019-06-06T09:51:54.259317+00:00 r1n1 slurmd[108517]: <br>
task/affinity: job 11057 CPU input mask for node: 0x0000000001<br>
2019-06-06T09:51:54.259680+00:00 r1n1 slurmd[108517]: <br>
task/affinity: job 11057 CPU final HW mask for node: 0x0000000001<br>
2019-06-06T09:51:54.279614+00:00 r1n1 slurmstepd[108548]: task <br>
affinity plugin loaded with CPU mask <br>
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff<br>
2019-06-06T09:51:54.280312+00:00 r1n1 slurmstepd[108548]: Munge <br>
credential signature plugin loaded<br>
2019-06-06T09:51:54.302921+00:00 r1n1 slurmstepd[108548]: <br>
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB <br>
memsw.limit=unlimited<br>
2019-06-06T09:51:54.303592+00:00 r1n1 slurmstepd[108548]: <br>
task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB <br>
mem.limit=1024MB memsw.limit=unlimited<br>
2019-06-06T09:51:54.330322+00:00 r1n1 slurmd[108517]: Launching <br>
batch job 11057 for UID 2000<br>
2019-06-06T09:51:54.353196+00:00 r1n1 slurmstepd[108556]: task <br>
affinity plugin loaded with CPU mask <br>
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff<br>
2019-06-06T09:51:54.353899+00:00 r1n1 slurmstepd[108556]: Munge <br>
credential signature plugin loaded<br>
2019-06-06T09:51:54.366478+00:00 r1n1 slurmstepd[108556]: <br>
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB <br>
memsw.limit=unlimited<br>
2019-06-06T09:51:54.366755+00:00 r1n1 slurmstepd[108556]: <br>
task/cgroup: /slurm/uid_2000/job_11057/step_batch: alloc=1024MB <br>
mem.limit=1024MB memsw.limit=unlimited<br>
2019-06-06T09:51:54.393325+00:00 r1n1 slurmstepd[108556]: debug <br>
level = 2<br>
2019-06-06T09:51:54.393754+00:00 r1n1 slurmstepd[108556]: starting <br>
1 tasks<br>
2019-06-06T09:51:54.401243+00:00 r1n1 slurmstepd[108556]: task 0 <br>
(108561) started 2019-06-06T09:51:54<br>
2019-06-06T09:51:54.416396+00:00 r1n1 slurmstepd[108561]: <br>
task_p_pre_launch: Using sched_affinity for tasks<br>
2019-06-06T09:51:56.514908+00:00 r1n1 slurmstepd[108556]: task 0 <br>
(108561) exited with exit code 1.<br>
2019-06-06T09:51:56.554430+00:00 r1n1 slurmstepd[108556]: job 11057 <br>
completed with slurm_rc = 0, job_rc = 256<br>
2019-06-06T09:51:56.554847+00:00 r1n1 slurmstepd[108556]: sending <br>
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256<br>
2019-06-06T09:51:56.559856+00:00 r1n1 slurmstepd[108556]: done with job<br>
2019-06-06T09:51:56.596762+00:00 r1n1 slurmstepd[108548]: Sent <br>
signal 18 to 11057.4294967295<br>
2019-06-06T09:51:56.598072+00:00 r1n1 slurmstepd[108548]: Sent <br>
signal 15 to 11057.4294967295<br>
2019-06-06T09:51:56.599141+00:00 r1n1 slurmstepd[108548]: <br>
_oom_event_monitor: oom-kill event count: 1<br>
2019-06-06T09:51:56.641170+00:00 r1n1 slurmstepd[108548]: done with job<br>
<br>
Secundary node (slurmd log):<br>
2019-06-06T09:51:54.256047+00:00 r1n2 slurmd[84916]: _run_prolog: <br>
run job script took usec=7<br>
2019-06-06T09:51:54.256432+00:00 r1n2 slurmd[84916]: _run_prolog: <br>
prolog with lock for job 11057 ran for 0 seconds<br>
2019-06-06T09:51:54.279763+00:00 r1n2 slurmstepd[84954]: task <br>
affinity plugin loaded with CPU mask <br>
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff<br>
2019-06-06T09:51:54.280448+00:00 r1n2 slurmstepd[84954]: Munge <br>
credential signature plugin loaded<br>
2019-06-06T09:51:54.313852+00:00 r1n2 slurmstepd[84954]: <br>
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB <br>
memsw.limit=unlimited<br>
2019-06-06T09:51:54.314502+00:00 r1n2 slurmstepd[84954]: <br>
task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB <br>
mem.limit=1024MB memsw.limit=unlimited<br>
2019-06-06T09:51:56.564758+00:00 r1n2 slurmstepd[84954]: Sent <br>
signal 18 to 11057.4294967295<br>
2019-06-06T09:51:56.608028+00:00 r1n2 slurmstepd[84954]: Sent <br>
signal 15 to 11057.4294967295<br>
2019-06-06T09:51:56.609259+00:00 r1n2 slurmstepd[84954]: <br>
_oom_event_monitor: oom-kill event count: 1<br>
2019-06-06T09:51:56.638334+00:00 r1n2 slurmstepd[84954]: done with job<br>
<br>
Thank you very much again.<br>
<br>
-- <br>
Andrés Marín Díaz<br>
<br>
Servicio de Infraestructura e Innovación<br>
Universidad Politécnica de Madrid<br>
<br>
Centro de Supercomputación y Visualización de Madrid (CeSViMa)<br>
Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)<br>
<a href="mailto:amarin@cesvima.upm.es" target="_blank">amarin@cesvima.upm.es</a> | tel 910679676<br>
<br>
<a href="http://www.cesvima.upm.es" rel="noreferrer" target="_blank">www.cesvima.upm.es</a> |
<a href="http://www.twitter.com/cesvima" rel="noreferrer" target="_blank">www.twitter.com/cesvima</a> |
<a href="http://www.fb.com/cesvima" rel="noreferrer" target="_blank">www.fb.com/cesvima</a><br>
<br>
<br>
<br>
</blockquote>
</div>
</body>
</html>