<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Hello,<br>
<br>
We have tried to compile it in 2 ways, in principle we had
compiled it with pmix in the following way:<br>
rpmbuild -ta slurm-19.05.0.tar.bz2 --define = '_ with_pmix
--with-pmix = / opt / pmix / 3.1.2 /'<br>
<br>
But we have also tried compiling it without pmix:<br>
rpmbuild -ta slurm-19.05.0.tar.bz2<br>
<br>
In both cases the result is the same.<br>
<br>
In the slurm.conf we have defined:<br>
MpiDefault = pmi2<br>
<br>
Thank you,<br>
A greeting.<br>
</p>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">El 06/06/2019 a las 13:27, Sean Crosby
escribió:<br>
</div>
<blockquote type="cite"
cite="mid:CADg9O+N2B=TnwiJaaeH_vf+c1=vND-0hioE-BR5BGeNcdCL2=Q@mail.gmail.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div>How did you compile SLURM? Did you add the contribs/pmi
and/or contribs/pmi2 plugins to the install? Or did you use
PMIx?</div>
<div><br>
</div>
<div>Sean<br>
</div>
<div>
<div>
<div dir="ltr" class="gmail_signature"
data-smartmail="gmail_signature">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<div dir="ltr">--</div>
<div dir="ltr">Sean Crosby</div>
<div dir="ltr">
<div dir="ltr">Senior DevOpsHPC Engineer
and HPC Team Lead | Research Platform
Services</div>
<div>Research Computing | CoEPP | School
of Physics<br>
</div>
</div>
<div dir="ltr"><span
style="font-size:12.8px">University of
Melbourne</span><br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, 6 Jun 2019 at 21:11,
Andrés Marín Díaz <<a href="mailto:amarin@cesvima.upm.es"
moz-do-not-send="true">amarin@cesvima.upm.es</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p>Hello,<br>
<br>
Yes, we have recompiled OpenMPI with integration with
SLURM 19.05 but the problem remains.<br>
<br>
We have also tried to recompile OpenMPI without
integration with SLURM. In this case executions fail with
srun, but with mpirun it continues to work in SLURM 18.08
and fails in 19.05 in the same way.<br>
<br>
Thank you very much once more.</p>
<p><br>
</p>
<div class="gmail-m_339349209111126957moz-cite-prefix">El
06/06/2019 a las 12:46, Sean Crosby escribió:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>Hi Andrés,</div>
<div><br>
</div>
<div>Did you recompile OpenMPI after updating to SLURM
19.05?</div>
<div><br>
</div>
<div>Sean<br>
</div>
<div>
<div>
<div dir="ltr"
class="gmail-m_339349209111126957gmail_signature">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<div dir="ltr">--</div>
<div dir="ltr">Sean Crosby</div>
<div dir="ltr">
<div dir="ltr">Senior DevOpsHPC
Engineer and HPC Team Lead |
Research Platform Services</div>
<div>Research Computing | CoEPP |
School of Physics<br>
</div>
</div>
<div dir="ltr"><span
style="font-size:12.8px">University
of Melbourne</span><br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, 6 Jun 2019 at
20:11, Andrés Marín Díaz <<a
href="mailto:amarin@cesvima.upm.es" target="_blank"
moz-do-not-send="true">amarin@cesvima.upm.es</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px
0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
Thank you very much for the help, I update some
information.<br>
<br>
- If we use Intel MPI (IMPI) mpirun it works
correctly.<br>
- If we use mpirun without using the scheduler it
works correctly.<br>
- If we use srun with software compiled with OpenMPI
it works correctly.<br>
- If we use SLURM 18.08.6 it works correctly.<br>
- If we use SLURM 19.05.0 and mpirun inside the sbatch
script then we <br>
get the error:<br>
--------------------------------------------------------------------------<br>
An ORTE daemon has unexpectedly failed after
launch and before<br>
communicating back to mpirun. This could be
caused by a number<br>
of factors, including an inability to create a
connection back<br>
to mpirun due to lack of common network
interfaces and / or no<br>
route found between them. Please check network
connectivity<br>
(including firewalls and network routing
requirements).<br>
--------------------------------------------------------------------------<br>
<br>
Trying to trace the problem:<br>
- Mpirun is a binary and can not be traced with batch
-x.<br>
- I've done a "strace mpirun hostname" to see if it
helps, but i am not <br>
able to see where the problem may be.<br>
<br>
Here is the exit from the strace:<br>
<a
href="https://cloud.cesvima.upm.es/index.php/s/hWQMkwU5zW7J8RW"
rel="noreferrer" target="_blank"
moz-do-not-send="true">https://cloud.cesvima.upm.es/index.php/s/hWQMkwU5zW7J8RW</a><br>
<br>
And here the slurmd log with verbose level 5:<br>
Main node (slurmd log):<br>
2019-06-06T09:51:54.255743+00:00 r1n1
slurmd[108517]: _run_prolog: <br>
run job script took usec=7<br>
2019-06-06T09:51:54.256118+00:00 r1n1
slurmd[108517]: _run_prolog: <br>
prolog with lock for job 11057 ran for 0 seconds<br>
2019-06-06T09:51:54.258887+00:00 r1n1
slurmd[108517]: <br>
task_p_slurmd_batch_request: 11057<br>
2019-06-06T09:51:54.259317+00:00 r1n1
slurmd[108517]: <br>
task/affinity: job 11057 CPU input mask for node:
0x0000000001<br>
2019-06-06T09:51:54.259680+00:00 r1n1
slurmd[108517]: <br>
task/affinity: job 11057 CPU final HW mask for node:
0x0000000001<br>
2019-06-06T09:51:54.279614+00:00 r1n1
slurmstepd[108548]: task <br>
affinity plugin loaded with CPU mask <br>
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff<br>
2019-06-06T09:51:54.280312+00:00 r1n1
slurmstepd[108548]: Munge <br>
credential signature plugin loaded<br>
2019-06-06T09:51:54.302921+00:00 r1n1
slurmstepd[108548]: <br>
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB
mem.limit=1024MB <br>
memsw.limit=unlimited<br>
2019-06-06T09:51:54.303592+00:00 r1n1
slurmstepd[108548]: <br>
task/cgroup: /slurm/uid_2000/job_11057/step_extern:
alloc=1024MB <br>
mem.limit=1024MB memsw.limit=unlimited<br>
2019-06-06T09:51:54.330322+00:00 r1n1
slurmd[108517]: Launching <br>
batch job 11057 for UID 2000<br>
2019-06-06T09:51:54.353196+00:00 r1n1
slurmstepd[108556]: task <br>
affinity plugin loaded with CPU mask <br>
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff<br>
2019-06-06T09:51:54.353899+00:00 r1n1
slurmstepd[108556]: Munge <br>
credential signature plugin loaded<br>
2019-06-06T09:51:54.366478+00:00 r1n1
slurmstepd[108556]: <br>
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB
mem.limit=1024MB <br>
memsw.limit=unlimited<br>
2019-06-06T09:51:54.366755+00:00 r1n1
slurmstepd[108556]: <br>
task/cgroup: /slurm/uid_2000/job_11057/step_batch:
alloc=1024MB <br>
mem.limit=1024MB memsw.limit=unlimited<br>
2019-06-06T09:51:54.393325+00:00 r1n1
slurmstepd[108556]: debug <br>
level = 2<br>
2019-06-06T09:51:54.393754+00:00 r1n1
slurmstepd[108556]: starting <br>
1 tasks<br>
2019-06-06T09:51:54.401243+00:00 r1n1
slurmstepd[108556]: task 0 <br>
(108561) started 2019-06-06T09:51:54<br>
2019-06-06T09:51:54.416396+00:00 r1n1
slurmstepd[108561]: <br>
task_p_pre_launch: Using sched_affinity for tasks<br>
2019-06-06T09:51:56.514908+00:00 r1n1
slurmstepd[108556]: task 0 <br>
(108561) exited with exit code 1.<br>
2019-06-06T09:51:56.554430+00:00 r1n1
slurmstepd[108556]: job 11057 <br>
completed with slurm_rc = 0, job_rc = 256<br>
2019-06-06T09:51:56.554847+00:00 r1n1
slurmstepd[108556]: sending <br>
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256<br>
2019-06-06T09:51:56.559856+00:00 r1n1
slurmstepd[108556]: done with job<br>
2019-06-06T09:51:56.596762+00:00 r1n1
slurmstepd[108548]: Sent <br>
signal 18 to 11057.4294967295<br>
2019-06-06T09:51:56.598072+00:00 r1n1
slurmstepd[108548]: Sent <br>
signal 15 to 11057.4294967295<br>
2019-06-06T09:51:56.599141+00:00 r1n1
slurmstepd[108548]: <br>
_oom_event_monitor: oom-kill event count: 1<br>
2019-06-06T09:51:56.641170+00:00 r1n1
slurmstepd[108548]: done with job<br>
<br>
Secundary node (slurmd log):<br>
2019-06-06T09:51:54.256047+00:00 r1n2
slurmd[84916]: _run_prolog: <br>
run job script took usec=7<br>
2019-06-06T09:51:54.256432+00:00 r1n2
slurmd[84916]: _run_prolog: <br>
prolog with lock for job 11057 ran for 0 seconds<br>
2019-06-06T09:51:54.279763+00:00 r1n2
slurmstepd[84954]: task <br>
affinity plugin loaded with CPU mask <br>
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff<br>
2019-06-06T09:51:54.280448+00:00 r1n2
slurmstepd[84954]: Munge <br>
credential signature plugin loaded<br>
2019-06-06T09:51:54.313852+00:00 r1n2
slurmstepd[84954]: <br>
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB
mem.limit=1024MB <br>
memsw.limit=unlimited<br>
2019-06-06T09:51:54.314502+00:00 r1n2
slurmstepd[84954]: <br>
task/cgroup: /slurm/uid_2000/job_11057/step_extern:
alloc=1024MB <br>
mem.limit=1024MB memsw.limit=unlimited<br>
2019-06-06T09:51:56.564758+00:00 r1n2
slurmstepd[84954]: Sent <br>
signal 18 to 11057.4294967295<br>
2019-06-06T09:51:56.608028+00:00 r1n2
slurmstepd[84954]: Sent <br>
signal 15 to 11057.4294967295<br>
2019-06-06T09:51:56.609259+00:00 r1n2
slurmstepd[84954]: <br>
_oom_event_monitor: oom-kill event count: 1<br>
2019-06-06T09:51:56.638334+00:00 r1n2
slurmstepd[84954]: done with job<br>
<br>
Thank you very much again.<br>
<br>
-- <br>
Andrés Marín Díaz<br>
<br>
Servicio de Infraestructura e Innovación<br>
Universidad Politécnica de Madrid<br>
<br>
Centro de Supercomputación y Visualización de Madrid
(CeSViMa)<br>
Campus de Montegancedo. 28223, Pozuelo de Alarcón,
Madrid (ES)<br>
<a href="mailto:amarin@cesvima.upm.es"
target="_blank" moz-do-not-send="true">amarin@cesvima.upm.es</a>
| tel 910679676<br>
<br>
<a href="http://www.cesvima.upm.es"
rel="noreferrer" target="_blank"
moz-do-not-send="true">www.cesvima.upm.es</a> |
<a href="http://www.twitter.com/cesvima"
rel="noreferrer" target="_blank"
moz-do-not-send="true">www.twitter.com/cesvima</a> |
<a href="http://www.fb.com/cesvima" rel="noreferrer"
target="_blank" moz-do-not-send="true">www.fb.com/cesvima</a><br>
<br>
<br>
<br>
</blockquote>
</div>
</blockquote>
<pre class="gmail-m_339349209111126957moz-signature" cols="72">--
Andrés Marín Díaz
Servicio de Infraestructura e Innovación
Universidad Politécnica de Madrid
Centro de Supercomputación y Visualización de Madrid (CeSViMa)
Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
<a class="gmail-m_339349209111126957moz-txt-link-abbreviated" href="mailto:amarin@cesvima.upm.es" target="_blank" moz-do-not-send="true">amarin@cesvima.upm.es</a> | tel 910679676
<a class="gmail-m_339349209111126957moz-txt-link-abbreviated" href="http://www.cesvima.upm.es" target="_blank" moz-do-not-send="true">www.cesvima.upm.es</a> | <a class="gmail-m_339349209111126957moz-txt-link-abbreviated" href="http://www.twitter.com/cesvima" target="_blank" moz-do-not-send="true">www.twitter.com/cesvima</a> | <a class="gmail-m_339349209111126957moz-txt-link-abbreviated" href="http://www.fb.com/cesvima" target="_blank" moz-do-not-send="true">www.fb.com/cesvima</a>
</pre>
</div>
</blockquote>
</div>
</blockquote>
<pre class="moz-signature" cols="72">--
Andrés Marín Díaz
Servicio de Infraestructura e Innovación
Universidad Politécnica de Madrid
Centro de Supercomputación y Visualización de Madrid (CeSViMa)
Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
<a class="moz-txt-link-abbreviated" href="mailto:amarin@cesvima.upm.es">amarin@cesvima.upm.es</a> | tel 910679676
<a class="moz-txt-link-abbreviated" href="http://www.cesvima.upm.es">www.cesvima.upm.es</a> | <a class="moz-txt-link-abbreviated" href="http://www.twitter.com/cesvima">www.twitter.com/cesvima</a> | <a class="moz-txt-link-abbreviated" href="http://www.fb.com/cesvima">www.fb.com/cesvima</a>
</pre>
</body>
</html>