<div dir="ltr"><div dir="ltr"><div class="gmail_default" style="font-family:courier new,monospace">Hi Jeremy,</div><div class="gmail_default" style="font-family:courier new,monospace"><br></div><div class="gmail_default" style="font-family:courier new,monospace">What is the value of TreeWidth in your slurm.conf? If there is no entry then I recommend setting it to a value a bit larger than the number of nodes you have in your cluster and then restarting slurmctld. </div><div class="gmail_default" style="font-family:courier new,monospace"><br></div><div class="gmail_default" style="font-family:courier new,monospace">Best,</div><div class="gmail_default" style="font-family:courier new,monospace"><br></div><div class="gmail_default" style="font-family:courier new,monospace">Steve</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Feb 2, 2022 at 12:59 AM Jeremy Fix <<a href="mailto:Jeremy.Fix@centralesupelec.fr">Jeremy.Fix@centralesupelec.fr</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>

<br>

A follow-up. I though some of nodes were ok but that's not the case; <br>

This morning, another pool of consecutive (why consecutive by the way? <br>

they are always consecutively failing) compute nodes are idle* . And now <br>

of the nodes which were drained came back to life in idle and now again <br>

switched to idle*.<br>

<br>

One thing I should mention is that the master is now handling a total of <br>

148 nodes; That's the new pool of 100 nodes which have a cycling state. <br>

The previous 48 nodes that already handled by this master are ok.<br>

<br>

I do not know if this should be considered a large system but we tried <br>

to have a look to settings such as the ARP cache [1] on the slurm <br>

master. I'm not very familiar with that, it seems to me it enlarges the <br>

cache of the node names/IPs table. This morning, the master has 125 <br>

lines in "arp -a" (before changing the settings in systctl , it was <br>

like, 20 or so); Do you think  this settings is also necessary on the <br>

compute nodes ?<br>

<br>

Best;<br>

<br>

Jeremy.<br>

<br>

<br>

[1] <br>

<a href="https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks" rel="noreferrer" target="_blank">https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks</a><br>

<br>

<br>

<br>

<br>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><span style="font-family:"courier new",monospace">________________________________________________________________</span><br style="font-family:"courier new",monospace"><span style="font-family:"courier new",monospace"> Steve Cousins             Supercomputer Engineer/Administrator</span><br style="font-family:"courier new",monospace"><span style="font-family:"courier new",monospace"> Advanced Computing Group            University of Maine System</span><br style="font-family:"courier new",monospace"><span style="font-family:"courier new",monospace"> 244 Neville Hall (UMS Data Center)              (207) 581-3574</span><br style="font-family:"courier new",monospace"><span style="font-family:"courier new",monospace"> Orono ME 04469                      steve.cousins at <a href="http://maine.edu" target="_blank">maine.edu</a></span><br><br></div></div></div></div>