<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal">Hi,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">My MPICH jobs are being launched and the desired number of processes are created, but when one of those processes trys to spawn a new process using MPI_Comm_spawn(), that process just spins in the polling code deep within the MPICH library.
See the Slurm <span style="color:red">error message</span> below. This all works without problems on other clusters that have Torque as the process manager. We are using Slurm 20.02.3 on redhat 4.18.0, and MPICH 4.0b1.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">salloc: defined options<o:p></o:p></p>
<p class="MsoNormal">salloc: -------------------- --------------------<o:p></o:p></p>
<p class="MsoNormal">salloc: cpus-per-task : 24<o:p></o:p></p>
<p class="MsoNormal">salloc: ntasks : 2<o:p></o:p></p>
<p class="MsoNormal">salloc: verbose : 1<o:p></o:p></p>
<p class="MsoNormal">salloc: -------------------- --------------------<o:p></o:p></p>
<p class="MsoNormal">salloc: end of defined options<o:p></o:p></p>
<p class="MsoNormal">salloc: Linear node selection plugin loaded with argument 4<o:p></o:p></p>
<p class="MsoNormal">salloc: select/cons_res loaded with argument 4<o:p></o:p></p>
<p class="MsoNormal">salloc: Cray/Aries node selection plugin loaded<o:p></o:p></p>
<p class="MsoNormal">salloc: select/cons_tres loaded with argument 4<o:p></o:p></p>
<p class="MsoNormal">salloc: Granted job allocation 34330<o:p></o:p></p>
<p class="MsoNormal"><span style="color:red">srun: error: Unable to create step for job 34330: Requested node configuration is not availableta</span><o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I’m wondering if the salloc command I am using is correct. I intend for it to launch 2 processes, one per node, but reserve 24 cores on each node for the 2 launched processes to spawn new processes using MPI_Comm_spawn. Could the reservation
of all 24 cores make slurm or MPICH think that there are no more cores available?<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><b>salloc –ntasks=2 –cpus-per-task=24 –verbose runscript.bash …<o:p></o:p></b></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I think that our cluster’s compute nodes are configured correctly –<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><b>$ scontrol show node=n001<o:p></o:p></b></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">NodeName=n001 Arch=x86_64 CoresPerSocket=6 <o:p></o:p></p>
<p class="MsoNormal"> CPUAlloc=0 CPUTot=24 CPULoad=0.00<o:p></o:p></p>
<p class="MsoNormal"> AvailableFeatures=(null)<o:p></o:p></p>
<p class="MsoNormal"> ActiveFeatures=(null)<o:p></o:p></p>
<p class="MsoNormal"> Gres=(null)<o:p></o:p></p>
<p class="MsoNormal"> NodeAddr=n001 NodeHostName=n001 Version=20.02.3<o:p></o:p></p>
<p class="MsoNormal"> OS=Linux 4.18.0-348.el8.x86_64 #1 SMP Mon Oct 4 12:17:22 EDT 2021
<o:p></o:p></p>
<p class="MsoNormal"> RealMemory=128351 AllocMem=0 FreeMem=126160 Sockets=4 Boards=1<o:p></o:p></p>
<p class="MsoNormal"> State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A<o:p></o:p></p>
<p class="MsoNormal"> Partitions=normal,low,high <o:p></o:p></p>
<p class="MsoNormal"> BootTime=2021-12-21T14:25:05 SlurmdStartTime=2021-12-21T14:25:52<o:p></o:p></p>
<p class="MsoNormal"> CfgTRES=cpu=24,mem=128351M,billing=24<o:p></o:p></p>
<p class="MsoNormal"> AllocTRES=<o:p></o:p></p>
<p class="MsoNormal"> CapWatts=n/a<o:p></o:p></p>
<p class="MsoNormal"> CurrentWatts=0 AveWatts=0<o:p></o:p></p>
<p class="MsoNormal"> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Thanks for any help.<o:p></o:p></p>
</div>
</body>
</html>