<div dir="ltr">Hi,<div>Did you recently add nodes? We have seen that when we add nodes past the treewidth count the most recently added nodes will lose communication (asterisk next to node name in sifo). We have to ensure the treewidth declaration in the slurm.conf matches or exceeds the number of nodes. </div><div><br></div><div>Doug</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 21, 2022 at 4:33 AM Durai Arasan <<a href="mailto:arasan.durai@gmail.com">arasan.durai@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hello MIke,<div><br></div><div>I am able to ping the nodes from the slurm master without any problem. Actually there is nothing interesting in slurmctld.log or slurmd.log. You can trust me on this. That is why I posted here.</div><div><br></div><div>Best,</div><div>Durai Arasan<br>MPI Tuebingen<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jan 20, 2022 at 5:08 PM Michael Robbert <<a href="mailto:mrobbert@mines.edu" target="_blank">mrobbert@mines.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div lang="EN-US">
<div>
<p class="MsoNormal">It looks like it could be some kind of network problem but could be DNS. Can you ping and do DNS resolution for the host involved?<u></u><u></u></p>
<p class="MsoNormal">What does slurmctld.log say? How about slurmd.log on the node in question?<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">Mike<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<div style="border-right:none;border-bottom:none;border-left:none;border-top:1pt solid rgb(181,196,223);padding:3pt 0in 0in">
<p class="MsoNormal" style="margin-bottom:12pt"><b><span style="font-size:12pt;color:black">From:
</span></b><span style="font-size:12pt;color:black">slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>> on behalf of Durai Arasan <<a href="mailto:arasan.durai@gmail.com" target="_blank">arasan.durai@gmail.com</a>><br>
<b>Date: </b>Thursday, January 20, 2022 at 08:08<br>
<b>To: </b>Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br>
<b>Subject: </b>[External] Re: [slurm-users] srun : Communication connection failure<u></u><u></u></span></p>
</div>
<div style="border:1pt solid rgb(156,101,0);padding:2pt">
<p class="MsoNormal" style="line-height:12pt;background:rgb(255,235,156)"><b><span style="font-size:10pt;color:rgb(156,101,0)">CAUTION:</span></b><span style="font-size:10pt;color:black"> This email originated from outside of the Colorado School of Mines organization.
Do not click on links or open attachments unless you recognize the sender and know the content is safe.<u></u><u></u></span></p>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<div>
<p class="MsoNormal">Hello slurm users,<u></u><u></u></p>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">I forgot to mention that an identical interactive job works successfully on the gpu partitions (in the same cluster). So this is really puzzling.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Best,<u></u><u></u></p>
</div>
<div>
<div>
<p class="MsoNormal"><span style="color:black">Durai Arasan<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:black">MPI Tuebingen<u></u><u></u></span></p>
</div>
</div>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<div>
<p class="MsoNormal">On Thu, Jan 20, 2022 at 3:40 PM Durai Arasan <<a href="mailto:arasan.durai@gmail.com" target="_blank">arasan.durai@gmail.com</a>> wrote:<u></u><u></u></p>
</div>
<blockquote style="border-top:none;border-right:none;border-bottom:none;border-left:1pt solid rgb(204,204,204);padding:0in 0in 0in 6pt;margin-left:4.8pt;margin-right:0in">
<div>
<div>
<p class="MsoNormal">Hello Slurm users,<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">We are suddenly encountering strange errors while trying to launch interactive jobs on our cpu partitions. Have you encountered this problem before? Kindly let us know.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<p class="MsoNormal">[darasan84@bg-slurmb-login1 ~]$ srun --job-name "admin_test231" --ntasks=1 --nodes=1 --cpus-per-task=1 --partition=cpu-short --mem=1G --nodelist=slurm-cpu-hm-7 --time 1:00:00 --pty bash<br>
srun: error: Task launch for StepId=1137134.0 failed on node slurm-cpu-hm-7: Communication connection failure<br>
srun: error: Application launch failed: Communication connection failure<br>
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.<br>
srun: error: Timed out waiting for job step to complete<u></u><u></u></p>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Best regards,<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Durai Arasan<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">MPI Tuebingen<u></u><u></u></p>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</blockquote></div>
</blockquote></div>