<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Consolas;
panose-1:2 11 6 9 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman",serif;
color:black;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p
{mso-style-priority:99;
mso-margin-top-alt:auto;
margin-right:0in;
mso-margin-bottom-alt:auto;
margin-left:0in;
font-size:12.0pt;
font-family:"Times New Roman",serif;
color:black;}
pre
{mso-style-priority:99;
mso-style-link:"HTML Preformatted Char";
margin:0in;
margin-bottom:.0001pt;
font-size:10.0pt;
font-family:"Courier New";
color:black;}
p.msonormal0, li.msonormal0, div.msonormal0
{mso-style-name:msonormal;
mso-margin-top-alt:auto;
margin-right:0in;
mso-margin-bottom-alt:auto;
margin-left:0in;
font-size:12.0pt;
font-family:"Times New Roman",serif;
color:black;}
span.HTMLPreformattedChar
{mso-style-name:"HTML Preformatted Char";
mso-style-priority:99;
mso-style-link:"HTML Preformatted";
font-family:Consolas;
color:black;}
span.EmailStyle21
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body bgcolor="white" lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Anyone know what would happen to running jobs if we switch to cgroups? We missed getting this set when we had a general cluster shutdown and want to get it set
but do have running jobs at the moment. Thanks<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
<div>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Deborah Crocker, PhD<br>
Systems Engineer III <br>
Office of Information Technology <br>
The University of Alabama<br>
Box 870346 <br>
Tuscaloosa, AL 36587 <br>
Office 205-348-3758 | Fax 205-348-9393 <br>
<a href="mailto:deborah.crocker@ua.edu"><span style="color:#0563C1">deborah.crocker@ua.edu</span></a><o:p></o:p></span></p>
</div>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
<div>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:windowtext">From:</span></b><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:windowtext"> slurm-users <slurm-users-bounces@lists.schedmd.com>
<b>On Behalf Of </b>Andrés Marín Díaz<br>
<b>Sent:</b> Thursday, June 6, 2019 8:58 AM<br>
<b>To:</b> slurm-users@lists.schedmd.com<br>
<b>Subject:</b> Re: [slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<p>Hello,<br>
<br>
We have tried to compile it in 2 ways, in principle we had compiled it with pmix in the following way:<br>
rpmbuild -ta slurm-19.05.0.tar.bz2 --define = '_ with_pmix --with-pmix = / opt / pmix / 3.1.2 /'<br>
<br>
But we have also tried compiling it without pmix:<br>
rpmbuild -ta slurm-19.05.0.tar.bz2<br>
<br>
In both cases the result is the same.<br>
<br>
In the slurm.conf we have defined:<br>
MpiDefault = pmi2<br>
<br>
Thank you,<br>
A greeting.<o:p></o:p></p>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">El 06/06/2019 a las 13:27, Sean Crosby escribió:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<div>
<p class="MsoNormal">How did you compile SLURM? Did you add the contribs/pmi and/or contribs/pmi2 plugins to the install? Or did you use PMIx?<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Sean<o:p></o:p></p>
</div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">--<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Sean Crosby<o:p></o:p></p>
</div>
<div>
<div>
<p class="MsoNormal">Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Research Computing | CoEPP | School of Physics<o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"><span style="font-size:9.5pt">University of Melbourne</span><o:p></o:p></p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">On Thu, 6 Jun 2019 at 21:11, Andrés Marín Díaz <<a href="mailto:amarin@cesvima.upm.es">amarin@cesvima.upm.es</a>> wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">
<div>
<p>Hello,<br>
<br>
Yes, we have recompiled OpenMPI with integration with SLURM 19.05 but the problem remains.<br>
<br>
We have also tried to recompile OpenMPI without integration with SLURM. In this case executions fail with srun, but with mpirun it continues to work in SLURM 18.08 and fails in 19.05 in the same way.<br>
<br>
Thank you very much once more.<o:p></o:p></p>
<p><o:p> </o:p></p>
<div>
<p class="MsoNormal">El 06/06/2019 a las 12:46, Sean Crosby escribió:<o:p></o:p></p>
</div>
<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<div>
<p class="MsoNormal">Hi Andrés,<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Did you recompile OpenMPI after updating to SLURM 19.05?<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Sean<o:p></o:p></p>
</div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">--<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Sean Crosby<o:p></o:p></p>
</div>
<div>
<div>
<p class="MsoNormal">Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Research Computing | CoEPP | School of Physics<o:p></o:p></p>
</div>
</div>
<div>
<p class="MsoNormal"><span style="font-size:9.5pt">University of Melbourne</span><o:p></o:p></p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">On Thu, 6 Jun 2019 at 20:11, Andrés Marín Díaz <<a href="mailto:amarin@cesvima.upm.es" target="_blank">amarin@cesvima.upm.es</a>> wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">
<p class="MsoNormal" style="margin-bottom:12.0pt">Thank you very much for the help, I update some information.<br>
<br>
- If we use Intel MPI (IMPI) mpirun it works correctly.<br>
- If we use mpirun without using the scheduler it works correctly.<br>
- If we use srun with software compiled with OpenMPI it works correctly.<br>
- If we use SLURM 18.08.6 it works correctly.<br>
- If we use SLURM 19.05.0 and mpirun inside the sbatch script then we <br>
get the error:<br>
--------------------------------------------------------------------------<br>
An ORTE daemon has unexpectedly failed after launch and before<br>
communicating back to mpirun. This could be caused by a number<br>
of factors, including an inability to create a connection back<br>
to mpirun due to lack of common network interfaces and / or no<br>
route found between them. Please check network connectivity<br>
(including firewalls and network routing requirements).<br>
--------------------------------------------------------------------------<br>
<br>
Trying to trace the problem:<br>
- Mpirun is a binary and can not be traced with batch -x.<br>
- I've done a "strace mpirun hostname" to see if it helps, but i am not <br>
able to see where the problem may be.<br>
<br>
Here is the exit from the strace:<br>
<a href="https://cloud.cesvima.upm.es/index.php/s/hWQMkwU5zW7J8RW" target="_blank">https://cloud.cesvima.upm.es/index.php/s/hWQMkwU5zW7J8RW</a><br>
<br>
And here the slurmd log with verbose level 5:<br>
Main node (slurmd log):<br>
2019-06-06T09:51:54.255743+00:00 r1n1 slurmd[108517]: _run_prolog: <br>
run job script took usec=7<br>
2019-06-06T09:51:54.256118+00:00 r1n1 slurmd[108517]: _run_prolog: <br>
prolog with lock for job 11057 ran for 0 seconds<br>
2019-06-06T09:51:54.258887+00:00 r1n1 slurmd[108517]: <br>
task_p_slurmd_batch_request: 11057<br>
2019-06-06T09:51:54.259317+00:00 r1n1 slurmd[108517]: <br>
task/affinity: job 11057 CPU input mask for node: 0x0000000001<br>
2019-06-06T09:51:54.259680+00:00 r1n1 slurmd[108517]: <br>
task/affinity: job 11057 CPU final HW mask for node: 0x0000000001<br>
2019-06-06T09:51:54.279614+00:00 r1n1 slurmstepd[108548]: task <br>
affinity plugin loaded with CPU mask <br>
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff<br>
2019-06-06T09:51:54.280312+00:00 r1n1 slurmstepd[108548]: Munge <br>
credential signature plugin loaded<br>
2019-06-06T09:51:54.302921+00:00 r1n1 slurmstepd[108548]: <br>
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB <br>
memsw.limit=unlimited<br>
2019-06-06T09:51:54.303592+00:00 r1n1 slurmstepd[108548]: <br>
task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB <br>
mem.limit=1024MB memsw.limit=unlimited<br>
2019-06-06T09:51:54.330322+00:00 r1n1 slurmd[108517]: Launching <br>
batch job 11057 for UID 2000<br>
2019-06-06T09:51:54.353196+00:00 r1n1 slurmstepd[108556]: task <br>
affinity plugin loaded with CPU mask <br>
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff<br>
2019-06-06T09:51:54.353899+00:00 r1n1 slurmstepd[108556]: Munge <br>
credential signature plugin loaded<br>
2019-06-06T09:51:54.366478+00:00 r1n1 slurmstepd[108556]: <br>
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB <br>
memsw.limit=unlimited<br>
2019-06-06T09:51:54.366755+00:00 r1n1 slurmstepd[108556]: <br>
task/cgroup: /slurm/uid_2000/job_11057/step_batch: alloc=1024MB <br>
mem.limit=1024MB memsw.limit=unlimited<br>
2019-06-06T09:51:54.393325+00:00 r1n1 slurmstepd[108556]: debug <br>
level = 2<br>
2019-06-06T09:51:54.393754+00:00 r1n1 slurmstepd[108556]: starting <br>
1 tasks<br>
2019-06-06T09:51:54.401243+00:00 r1n1 slurmstepd[108556]: task 0 <br>
(108561) started 2019-06-06T09:51:54<br>
2019-06-06T09:51:54.416396+00:00 r1n1 slurmstepd[108561]: <br>
task_p_pre_launch: Using sched_affinity for tasks<br>
2019-06-06T09:51:56.514908+00:00 r1n1 slurmstepd[108556]: task 0 <br>
(108561) exited with exit code 1.<br>
2019-06-06T09:51:56.554430+00:00 r1n1 slurmstepd[108556]: job 11057 <br>
completed with slurm_rc = 0, job_rc = 256<br>
2019-06-06T09:51:56.554847+00:00 r1n1 slurmstepd[108556]: sending <br>
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256<br>
2019-06-06T09:51:56.559856+00:00 r1n1 slurmstepd[108556]: done with job<br>
2019-06-06T09:51:56.596762+00:00 r1n1 slurmstepd[108548]: Sent <br>
signal 18 to 11057.4294967295<br>
2019-06-06T09:51:56.598072+00:00 r1n1 slurmstepd[108548]: Sent <br>
signal 15 to 11057.4294967295<br>
2019-06-06T09:51:56.599141+00:00 r1n1 slurmstepd[108548]: <br>
_oom_event_monitor: oom-kill event count: 1<br>
2019-06-06T09:51:56.641170+00:00 r1n1 slurmstepd[108548]: done with job<br>
<br>
Secundary node (slurmd log):<br>
2019-06-06T09:51:54.256047+00:00 r1n2 slurmd[84916]: _run_prolog: <br>
run job script took usec=7<br>
2019-06-06T09:51:54.256432+00:00 r1n2 slurmd[84916]: _run_prolog: <br>
prolog with lock for job 11057 ran for 0 seconds<br>
2019-06-06T09:51:54.279763+00:00 r1n2 slurmstepd[84954]: task <br>
affinity plugin loaded with CPU mask <br>
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff<br>
2019-06-06T09:51:54.280448+00:00 r1n2 slurmstepd[84954]: Munge <br>
credential signature plugin loaded<br>
2019-06-06T09:51:54.313852+00:00 r1n2 slurmstepd[84954]: <br>
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB <br>
memsw.limit=unlimited<br>
2019-06-06T09:51:54.314502+00:00 r1n2 slurmstepd[84954]: <br>
task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB <br>
mem.limit=1024MB memsw.limit=unlimited<br>
2019-06-06T09:51:56.564758+00:00 r1n2 slurmstepd[84954]: Sent <br>
signal 18 to 11057.4294967295<br>
2019-06-06T09:51:56.608028+00:00 r1n2 slurmstepd[84954]: Sent <br>
signal 15 to 11057.4294967295<br>
2019-06-06T09:51:56.609259+00:00 r1n2 slurmstepd[84954]: <br>
_oom_event_monitor: oom-kill event count: 1<br>
2019-06-06T09:51:56.638334+00:00 r1n2 slurmstepd[84954]: done with job<br>
<br>
Thank you very much again.<br>
<br>
-- <br>
Andrés Marín Díaz<br>
<br>
Servicio de Infraestructura e Innovación<br>
Universidad Politécnica de Madrid<br>
<br>
Centro de Supercomputación y Visualización de Madrid (CeSViMa)<br>
Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)<br>
<a href="mailto:amarin@cesvima.upm.es" target="_blank">amarin@cesvima.upm.es</a> | tel 910679676<br>
<br>
<a href="http://www.cesvima.upm.es" target="_blank">www.cesvima.upm.es</a> | <a href="http://www.twitter.com/cesvima" target="_blank">
www.twitter.com/cesvima</a> | <a href="http://www.fb.com/cesvima" target="_blank">
www.fb.com/cesvima</a><br>
<br>
<br>
<o:p></o:p></p>
</blockquote>
</div>
</blockquote>
<pre>-- <o:p></o:p></pre>
<pre> Andrés Marín Díaz <o:p></o:p></pre>
<pre> <o:p></o:p></pre>
<pre> Servicio de Infraestructura e Innovación<o:p></o:p></pre>
<pre> Universidad Politécnica de Madrid<o:p></o:p></pre>
<pre> <o:p></o:p></pre>
<pre> Centro de Supercomputación y Visualización de Madrid (CeSViMa) <o:p></o:p></pre>
<pre> Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES) <o:p></o:p></pre>
<pre> <a href="mailto:amarin@cesvima.upm.es" target="_blank">amarin@cesvima.upm.es</a> | tel 910679676 <o:p></o:p></pre>
<pre> <o:p></o:p></pre>
<pre> <a href="http://www.cesvima.upm.es" target="_blank">www.cesvima.upm.es</a> | <a href="http://www.twitter.com/cesvima" target="_blank">www.twitter.com/cesvima</a> | <a href="http://www.fb.com/cesvima" target="_blank">www.fb.com/cesvima</a> <o:p></o:p></pre>
<pre> <o:p></o:p></pre>
</div>
</blockquote>
</div>
</blockquote>
<pre>-- <o:p></o:p></pre>
<pre> Andrés Marín Díaz <o:p></o:p></pre>
<pre> <o:p></o:p></pre>
<pre> Servicio de Infraestructura e Innovación<o:p></o:p></pre>
<pre> Universidad Politécnica de Madrid<o:p></o:p></pre>
<pre> <o:p></o:p></pre>
<pre> Centro de Supercomputación y Visualización de Madrid (CeSViMa) <o:p></o:p></pre>
<pre> Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES) <o:p></o:p></pre>
<pre> <a href="mailto:amarin@cesvima.upm.es">amarin@cesvima.upm.es</a> | tel 910679676 <o:p></o:p></pre>
<pre> <o:p></o:p></pre>
<pre> <a href="http://www.cesvima.upm.es">www.cesvima.upm.es</a> | <a href="http://www.twitter.com/cesvima">www.twitter.com/cesvima</a> | <a href="http://www.fb.com/cesvima">www.fb.com/cesvima</a> <o:p></o:p></pre>
<pre> <o:p></o:p></pre>
</div>
</body>
</html>