<div dir="ltr">We have a <div> CentOS 8.5 cluster </div><div> slurm 20.11</div><div> Mellanox ConnectX 6 HDR IB and Mellanox 32 port switch</div><div><br></div><div>Our application is not scaling. I discovered the process communications are going over ethernet, not ib. I used the ifconfig count for the eno2 (ethernet) and ib0 (infiniband) interfaces at end of a job, and subtracted the count at the beginning. We are using sbatch and</div><div>srun {application}</div><div><br></div><div>If I interactively login to a node and use the command</div><div>mpiexec -iface ib0 -n 32 -machinefile machinefile {application}</div><div><br></div><div>where machinefile contains 32 lines with the ib hostname:</div><div>ne08-ib</div><div>ne08-ib</div><div>...</div><div>ne09-ib</div><div>ne09-ib</div><div><br></div><div>the application runs over ib and scales. </div><div><br></div><div>/etc/slurm/slurm.conf uses the ethernet interface for administrative communications and allocation:</div><div><br></div><div>
<p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures">NodeName=ne[01-09] CPUs=32 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN</span></p><p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures"><br></span></p><p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures">
</span></p><p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures">PartitionName=neon-noSMT Nodes=ne[01-09] Default=NO MaxTime=3-00:00:00 DefaultTime=4:00:00 State=UP OverSubscribe=YES</span></p><p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures"><br></span></p><p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures">I've read this is the recommended configuration.</span></p></div><div><br></div><div>I looked for srun parameters that would instruct srun to run over the ib interface when the job is run through the slurm queue. </div><div><br></div><div>I found the --network parameter:</div><div>
<p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures">srun --network=DEVNAME=mlx5_ib,DEVTYPE=IB</span></p><p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures"><br></span></p><p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures">but there is not much documentation on this and I haven't been able to run a job yet.</span></p><p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures"><br></span></p><p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures">Is this the way we should be directing srun to run the executable over infiniband?</span></p><p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures"><br></span></p><p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures">Thanks in advance,</span></p><p class="gmail-p1" style="margin:0px;font-variant-numeric:normal;font-variant-east-asian:normal;font-stretch:normal;font-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span class="gmail-s1" style="font-variant-ligatures:no-common-ligatures">Anne Hammond</span></p></div><div><br></div><div><br></div></div>