<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <div class="moz-cite-prefix">Hello , Thank you for your suggestion
      and I thank also thank Tina;</div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix">To answer your question, there is no
      TreeWidth entry in the slurm.conf</div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix">But it seems we figured out the issue
      .... and I'm so sorry we did not think about it : we already had a
      pool of 48 nodes on the master but their slurm.conf diverged from
      the ones on the pool of dancing state nodes; At least, their
      slurmd was not restarted; <br>
    </div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix">And actually several people suggested
      that the slurmd need to talk between each other; That's really our
      fault; 100 nodes were aware of all the 148 nodes and 48 nodes were
      only aware of themselves; I suppose that created issues to the
      master;</div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix">So even if we also had other issues
      like interfaces flip flopping, the diverged slurm.conf was
      probably the issue. <br>
    </div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix">Thank you all for your help, It is time
      to compute :)<br>
    </div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix">Jeremy.<br>
    </div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix">On 02/02/2022 16:27, Stephen Cousins
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAMFqqRqprvk_csLqjJDMmtOAcya+3kTyauRViPno2tH6c+tObg@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">
        <div dir="ltr">
          <div class="gmail_default" style="font-family:courier
            new,monospace">Hi Jeremy,</div>
          <div class="gmail_default" style="font-family:courier
            new,monospace"><br>
          </div>
          <div class="gmail_default" style="font-family:courier
            new,monospace">What is the value of TreeWidth in your
            slurm.conf? If there is no entry then I recommend setting it
            to a value a bit larger than the number of nodes you have in
            your cluster and then restarting slurmctld. </div>
          <div class="gmail_default" style="font-family:courier
            new,monospace"><br>
          </div>
          <div class="gmail_default" style="font-family:courier
            new,monospace">Best,</div>
          <div class="gmail_default" style="font-family:courier
            new,monospace"><br>
          </div>
          <div class="gmail_default" style="font-family:courier
            new,monospace">Steve</div>
        </div>
        <br>
        <div class="gmail_quote">
          <div dir="ltr" class="gmail_attr">On Wed, Feb 2, 2022 at 12:59
            AM Jeremy Fix <<a
              href="mailto:Jeremy.Fix@centralesupelec.fr"
              moz-do-not-send="true" class="moz-txt-link-freetext">Jeremy.Fix@centralesupelec.fr</a>>
            wrote:<br>
          </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px
            0.8ex;border-left:1px solid
            rgb(204,204,204);padding-left:1ex">Hi,<br>
            <br>
            A follow-up. I though some of nodes were ok but that's not
            the case; <br>
            This morning, another pool of consecutive (why consecutive
            by the way? <br>
            they are always consecutively failing) compute nodes are
            idle* . And now <br>
            of the nodes which were drained came back to life in idle
            and now again <br>
            switched to idle*.<br>
            <br>
            One thing I should mention is that the master is now
            handling a total of <br>
            148 nodes; That's the new pool of 100 nodes which have a
            cycling state. <br>
            The previous 48 nodes that already handled by this master
            are ok.<br>
            <br>
            I do not know if this should be considered a large system
            but we tried <br>
            to have a look to settings such as the ARP cache [1] on the
            slurm <br>
            master. I'm not very familiar with that, it seems to me it
            enlarges the <br>
            cache of the node names/IPs table. This morning, the master
            has 125 <br>
            lines in "arp -a" (before changing the settings in systctl ,
            it was <br>
            like, 20 or so); Do you think  this settings is also
            necessary on the <br>
            compute nodes ?<br>
            <br>
            Best;<br>
            <br>
            Jeremy.<br>
            <br>
            <br>
            [1] <br>
            <a
href="https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks"
              rel="noreferrer" target="_blank" moz-do-not-send="true"
              class="moz-txt-link-freetext">https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks</a><br>
            <br>
            <br>
            <br>
            <br>
          </blockquote>
        </div>
        <br clear="all">
        <div><br>
        </div>
        -- <br>
        <div dir="ltr" class="gmail_signature">
          <div dir="ltr">
            <div><span style="font-family:"courier
                new",monospace">________________________________________________________________</span><br
                style="font-family:"courier new",monospace">
              <span style="font-family:"courier
                new",monospace"> Steve Cousins            
                Supercomputer Engineer/Administrator</span><br
                style="font-family:"courier new",monospace">
              <span style="font-family:"courier
                new",monospace"> Advanced Computing Group 
                          University of Maine System</span><br
                style="font-family:"courier new",monospace">
              <span style="font-family:"courier
                new",monospace"> 244 Neville Hall (UMS Data
                Center)              (207) 581-3574</span><br
                style="font-family:"courier new",monospace">
              <span style="font-family:"courier
                new",monospace"> Orono ME
                04469                      steve.cousins at <a
                  href="http://maine.edu" target="_blank"
                  moz-do-not-send="true">maine.edu</a></span></div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
  </body>
</html>