<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        font-size:12.0pt;
        font-family:"Calibri",sans-serif;
        mso-ligatures:standardcontextual;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
        {mso-style-priority:34;
        margin-top:0in;
        margin-right:0in;
        margin-bottom:0in;
        margin-left:.5in;
        font-size:12.0pt;
        font-family:"Calibri",sans-serif;
        mso-ligatures:standardcontextual;}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri",sans-serif;
        color:windowtext;}
span.apple-converted-space
        {mso-style-name:apple-converted-space;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:12.0pt;
        font-family:"Calibri",sans-serif;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
/* List Definitions */
@list l0
        {mso-list-id:865027409;
        mso-list-type:hybrid;
        mso-list-template-ids:-462935010 67698705 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
        {mso-level-text:"%1\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level2
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level5
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level8
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
ol
        {margin-bottom:0in;}
ul
        {margin-bottom:0in;}
--></style>
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">I have a user who is submitting a job to slurm which requests 16 tasks, i.e.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:black;mso-ligatures:none">#SBATCH --ntasks 16<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:black;mso-ligatures:none">#SBATCH –cpus-per-task 1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">The slurm script runs an mpi program called Parent.mpi, which then (fails to) call 15 mpi child processes. He’s tried two different ways for the parent to spawn the children:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<ol style="margin-top:0in" start="1" type="1">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1"><span style="color:black">A system() call, such as system(“srun --ntasks=4  mpirun -np 4 ./child.mpi”) or system(“mpirun -np 4 ./child.mpi”)</span><span style="font-size:11.0pt"><o:p></o:p></span></li></ol>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<ol style="margin-top:0in" start="2" type="1">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1"><span class="apple-converted-space"><span style="color:black"> </span></span><span style="color:black">MPI_Comm_Spawn</span><span style="font-size:11.0pt"><o:p></o:p></span></li></ol>
<p class="MsoListParagraph"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Both ways generate the following in the slurm output file:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:black;mso-ligatures:none">srun: Job ### step creation temporarily disabled, retrying (Requested nodes are busy)<br>
srun: error: Unable to create step for job ###: Job/step already completing or completed</span><span style="font-size:11.0pt;mso-ligatures:none"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">So, basically, he’s requesting 16 tasks, one of which is used by the parent and the other 15 are supposed to get used by the children, but the children can’t use the other 16 because...well, I’m not sure why.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Is there something I need to change in the slurm.conf to allow this to work?<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Times New Roman",serif;mso-ligatures:none">---<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Times New Roman",serif;mso-ligatures:none">Mike VanHorn<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Times New Roman",serif;mso-ligatures:none">Senior Computer Systems Administrator<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Times New Roman",serif;mso-ligatures:none">College of Engineering and Computer Science<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Times New Roman",serif;mso-ligatures:none">Wright State University<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Times New Roman",serif;mso-ligatures:none">265 Russ Engineering Center<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Times New Roman",serif;mso-ligatures:none">937-775-5157<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Times New Roman",serif;mso-ligatures:none">michael.vanhorn@wright.edu</span><o:p></o:p></p>
</div>
</body>
</html>