<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>That 'not responding' is the issue and usually means 1 of 2

      things:</p>

    <p>1) slurmd is not running on the node<br>

      2) something on the network is stopping the communication between

      the node and the master (firewall, selinux, congestion, bad nic,

      routes, etc)</p>

    <p>Brian Andrus<br>

    </p>

    <div class="moz-cite-prefix">On 7/30/2021 3:51 PM, Soichi Hayashi

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAGLeeFiw9qegBTdCTcG3mE9UmJ8-Hh7Z5oUj_8N2x84QE3eirA@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">

        <div>Brian,</div>

        <div><br>

        </div>

        <div>Thank you for your reply and thanks for setting the email

          title. I forgot to edit it before I sent it!</div>

        <div><br>

        </div>

        <div>I am not sure how I can reply to your your reply.. but I

          hope this make it so the right place..</div>

        <div><br>

        </div>

        <div>I've updated slurm.conf to increase the controller debug

          level</div>

        <div>> SlurmctldDebug=5</div>

        <div><br>

        </div>

        <div>I now see additional log output (debug).</div>

        <div><br>

        </div>

        <div><font face="monospace">[2021-07-30T22:42:05.255] debug:

             Spawning ping agent for slurm4-compute[2-6,10,12-14]<br>

            [2021-07-30T22:42:05.256] error: Nodes

            slurm4-compute[9,15,19-22,30] not responding, setting DOWN</font><br>

        </div>

        <div><br>

        </div>

        <div>It's still very sparse, but it looks like slurm is trying

          to ping nodes that are already removed (they don't exist

          anymore - as they are removed by slurm_suspend.sh script)</div>

        <div><br>

        </div>

        <div>I tried sinfo -R but it doesn't really give much info..</div>

        <div><br>

        </div>

        <div><font face="monospace">$ sinfo -R<br>

            REASON               USER      TIMESTAMP           NODELIST<br>

            Not responding       slurm     2021-07-30T22:42:05

            slurm4-compute[9,15,19-22,30]</font><br>

        </div>

        <div><br>

        </div>

        <div>These machines are gone, so it should not respond. </div>

        <div><br>

        </div>

        <div><font face="monospace">$ ping slurm4-compute9<br>

            ping: slurm4-compute9: Name or service not known</font><br>

        </div>

        <div><br>

        </div>

        <div>This is expected.</div>

        <div><br>

        </div>

        <div>Why is slurm keeps trying to contact the node that's

          already removed? slurm_suspend.sh does the following to

          "remove" node from the partition.</div>

        <div><font face="monospace">> scontrol update

            nodename=${host} nodeaddr="(null)"</font></div>

        <div>Maybe this isn't the correct way to do it? Is there a way

          to force slurm to forget about the node? I tried "scontrol

          update node=$node state=idle", but this only works for a few

          minutes until slurm's ping agent kicks in and marking them

          down again.</div>

        <div><br>

        </div>

        <div>Thanks!!</div>

        <div>Soichi </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div class="gmail_quote">

          <div dir="ltr" class="gmail_attr">On Fri, Jul 30, 2021 at 2:21

            PM Soichi Hayashi <<a href="mailto:hayashis@iu.edu"

              moz-do-not-send="true">hayashis@iu.edu</a>> wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

            0.8ex;border-left:1px solid

            rgb(204,204,204);padding-left:1ex">

            <div dir="ltr">Hello. I need a help with troubleshooting our

              slurm cluster. 

              <div><br>

              </div>

              <div>I am running slurm-wlm 17.11.2 on Ubuntu 20 on a

                public cloud infrastructure (Jetstream) using an elastic

                computing mechanism (<a

                  href="https://slurm.schedmd.com/elastic_computing.html"

                  target="_blank" moz-do-not-send="true">https://slurm.schedmd.com/elastic_computing.html</a>).

                Our cluster works for the most part, but for some

                reason, a few of our nodes constantly goes into "down"

                state.

                <div><br>

                </div>

                <div><font face="monospace">PARTITION AVAIL  TIMELIMIT  

                    JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE

                    NODELIST<br>

                    cloud*       up 2-00:00:00 1-infinite   no    YES:4

                           all     10       idle~

                    slurm9-compute[1-5,10,12-15]<br>

                    cloud*       up 2-00:00:00 1-infinite   no    YES:4

                           all      5        down slurm9-compute[6-9,11]</font><br>

                  <div><br>

                  </div>

                  <div>The only log I see in the slurm log is this..</div>

                  <div><br>

                  </div>

                  <div><font face="monospace">[2021-07-30T15:10:55.889]

                      Invalid node state transition requested for node

                      slurm9-compute6 from=COMPLETING to=RESUME<br>

                      [2021-07-30T15:21:37.339] Invalid node state

                      transition requested for node slurm9-compute6

                      from=COMPLETING* to=RESUME<br>

                      [2021-07-30T15:27:30.039] update_node: node

                      slurm9-compute6 reason set to: completing<br>

                      [2021-07-30T15:27:30.040] update_node: node

                      slurm9-compute6 state set to DOWN<br>

                      [2021-07-30T15:27:40.830] update_node: node

                      slurm9-compute6 state set to IDLE</font><br>

                  </div>

                  <div>..</div>

                  <div>[2021-07-30T15:34:20.628] error: Nodes

                    slurm9-compute[6-9,11] not responding, setting DOWN<br>

                  </div>

                </div>

                <div><br>

                </div>

                <div>WIth elastic computing, any unused nodes are

                  automatically removed

                  (by SuspendProgram=/usr/local/sbin/slurm_suspend.sh).

                  So nodes are *expected* to not respond once they are

                  removed, but they should not be marked as DOWN. They

                  should simply be set to "idle". </div>

              </div>

              <div><br>

              </div>

              <div>To work around this issue, I am running the following

                cron job.</div>

              <div><br>

              </div>

              <div><font face="monospace">0 0 * * * scontrol update

                  node=slurm9-compute[1-30] state=resume</font><br>

              </div>

              <div><br>

              </div>

              <div>This "works" somewhat.. but our nodes go to "DOWN"

                state so often that running this every hour is not

                enough.</div>

              <div><br>

              </div>

              <div>Here is the full content of our slurm.conf</div>

              <div><br>

              </div>

              <div><font face="monospace">root@slurm9:~# cat

                  /etc/slurm-llnl/slurm.conf <br>

                  ClusterName=slurm9<br>

                  ControlMachine=slurm9<br>

                  <br>

                  SlurmUser=slurm<br>

                  SlurmdUser=root<br>

                  SlurmctldPort=6817<br>

                  SlurmdPort=6818<br>

                  AuthType=auth/munge<br>

                  StateSaveLocation=/tmp<br>

                  SlurmdSpoolDir=/tmp/slurmd<br>

                  SwitchType=switch/none<br>

                  MpiDefault=none<br>

                  SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid<br>

                  SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid<br>

                  ProctrackType=proctrack/pgid<br>

                  ReturnToService=1<br>

                  Prolog=/usr/local/sbin/slurm_prolog.sh<br>

                  <br>

                  #<br>

                  # TIMERS<br>

                  SlurmctldTimeout=300<br>

                  SlurmdTimeout=300<br>

                  #make slurm a little more tolerant here<br>

                  MessageTimeout=30<br>

                  TCPTimeout=15<br>

                  BatchStartTimeout=20<br>

                  GetEnvTimeout=20<br>

                  InactiveLimit=0<br>

                  MinJobAge=604800<br>

                  KillWait=30<br>

                  Waittime=0<br>

                  #<br>

                  # SCHEDULING<br>

                  SchedulerType=sched/backfill<br>

                  SelectType=select/cons_res<br>

                  SelectTypeParameters=CR_CPU_Memory<br>

                  #FastSchedule=0<br>

                  <br>

                  # LOGGING<br>

                  SlurmctldDebug=3<br>

                  SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log<br>

                  SlurmdDebug=3<br>

                  SlurmdLogFile=/var/log/slurm-llnl/slurmd.log<br>

                  JobCompType=jobcomp/none<br>

                  <br>

                  # ACCOUNTING<br>

                  JobAcctGatherType=jobacct_gather/linux<br>

                  JobAcctGatherFrequency=30<br>

                  <br>

                  AccountingStorageType=accounting_storage/filetxt<br>

AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log<br>

                  <br>

                  #CLOUD CONFIGURATION<br>

                  PrivateData=cloud<br>

                  ResumeProgram=/usr/local/sbin/slurm_resume.sh<br>

                  SuspendProgram=/usr/local/sbin/slurm_suspend.sh<br>

                  ResumeRate=1 #number of nodes per minute that can be

                  created; 0 means no limit<br>

                  ResumeTimeout=900 #max time in seconds between

                  ResumeProgram running and when the node is ready for

                  use<br>

                  SuspendRate=1 #number of nodes per minute that can be

                  suspended/destroyed<br>

                  SuspendTime=600 #time in seconds before an idle node

                  is suspended<br>

                  SuspendTimeout=300 #time between running

                  SuspendProgram and the node being completely down<br>

                  TreeWidth=30<br>

                  <br>

                  NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24

                  RealMemory=60388<br>

                  PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15]

                  Default=YES MaxTime=48:00:00 State=UP Shared=YES</font><br>

              </div>

              <div><font face="monospace"><br>

                </font></div>

              <div><font face="arial, sans-serif">I appreciate your

                  assistance!</font></div>

              <div><font face="arial, sans-serif"><br>

                </font></div>

              <div><font face="arial, sans-serif">Soichi Hayashi</font></div>

              <div><font face="arial, sans-serif">Indiana University</font></div>

              <div><font face="arial, sans-serif"><br>

                </font></div>

              <div><font face="monospace"><br>

                </font></div>

            </div>

          </blockquote>

        </div>

      </div>

    </blockquote>

  </body>

</html>