<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>You should definitely upgrade because there have been significant

      improvements in that area.</p>

    <p>You can label nodes as cloud nodes and merely updating the state

      to 'power_down' will run your suspend script.</p>

    <p>Brian Andrus<br>

    </p>

    <div class="moz-cite-prefix">On 7/30/2021 5:05 PM, Soichi Hayashi

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAGLeeFiPnyBuoTqzPNoqTJ7wQ46-+PNHr8ZfxVsCp2-mM8MZ+A@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">Brian,

        <div><br>

        </div>

        <div>Yes, slurmd is not running on that node because the node

          itself is not there anymore (the whole VM is gone!). When the

          node is no longer in use, slurm automatically runs

          slurm_suspend.sh script which removes the whole node(VM) by

          running "openstack server delete $host". There is no

          server/VM, no IP address, no DNS name, nothing.

          "slurm4-compute9" only exists as a hypothetical node that can

          be launched in the future in case there are more jobs to run.

          That's how "cloud" partition works, right?</div>

        <div><br>

        </div>

        <div><font face="monospace">[slurm.conf]</font></div>

        <div><font face="monospace">SuspendProgram=/usr/local/sbin/slurm_suspend.sh<br>

            SuspendTime=600 #time in seconds before an idle node is

            suspended</font></div>

        <div><br>

        </div>

        <div>I am wondering.. maybe something went wrong when slurm ran

          slurm_suspend.sh so that slurm *thinks* that the node is still

          there.. so it tries to ping it, and it fails to ping it

          (obviously...) and marking it as DOWN?</div>

        <div><br>

        </div>

        <div>I don't know if my theory is right or not.. but just to get

          our cluster going again, is there a way to force slurm to

          forget about the node that it "suspended" earlier? Is there a

          command like "scontrol forcesuspend node=$id"?</div>

        <div><br>

        </div>

        <div>Thank you for your help!<br>

        </div>

        <div><br>

        </div>

        <div>-soichi </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Fri, Jul 30, 2021 at 7:56

          PM Brian Andrus <<a href="mailto:toomuchit@gmail.com"

            moz-do-not-send="true">toomuchit@gmail.com</a>> wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px

          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

          <div>

            <div

style="font-family:Arial,Helvetica,sans-serif;font-size:12px;background-color:rgb(255,236,229);color:rgb(130,39,13);border-left:0.25rem

              solid

              rgb(223,54,3);padding:0.5rem;text-align:left;line-height:1.25">

              This message was sent from a non-IU address. Please

              exercise caution when clicking links or opening

              attachments from external sources.</div>

            <br>

            <p>That 'not responding' is the issue and usually means 1 of

              2 things:</p>

            <p>1) slurmd is not running on the node<br>

              2) something on the network is stopping the communication

              between the node and the master (firewall, selinux,

              congestion, bad nic, routes, etc)</p>

            <p>Brian Andrus<br>

            </p>

            <div>On 7/30/2021 3:51 PM, Soichi Hayashi wrote:<br>

            </div>

            <blockquote type="cite">

              <div dir="ltr">

                <div>Brian,</div>

                <div><br>

                </div>

                <div>Thank you for your reply and thanks for setting the

                  email title. I forgot to edit it before I sent it!</div>

                <div><br>

                </div>

                <div>I am not sure how I can reply to your your reply..

                  but I hope this make it so the right place..</div>

                <div><br>

                </div>

                <div>I've updated slurm.conf to increase the controller

                  debug level</div>

                <div>> SlurmctldDebug=5</div>

                <div><br>

                </div>

                <div>I now see additional log output (debug).</div>

                <div><br>

                </div>

                <div><font face="monospace">[2021-07-30T22:42:05.255]

                    debug:  Spawning ping agent for

                    slurm4-compute[2-6,10,12-14]<br>

                    [2021-07-30T22:42:05.256] error: Nodes

                    slurm4-compute[9,15,19-22,30] not responding,

                    setting DOWN</font><br>

                </div>

                <div><br>

                </div>

                <div>It's still very sparse, but it looks like slurm is

                  trying to ping nodes that are already removed (they

                  don't exist anymore - as they are removed by

                  slurm_suspend.sh script)</div>

                <div><br>

                </div>

                <div>I tried sinfo -R but it doesn't really give much

                  info..</div>

                <div><br>

                </div>

                <div><font face="monospace">$ sinfo -R<br>

                    REASON               USER      TIMESTAMP          

                    NODELIST<br>

                    Not responding       slurm     2021-07-30T22:42:05

                    slurm4-compute[9,15,19-22,30]</font><br>

                </div>

                <div><br>

                </div>

                <div>These machines are gone, so it should not respond. </div>

                <div><br>

                </div>

                <div><font face="monospace">$ ping slurm4-compute9<br>

                    ping: slurm4-compute9: Name or service not known</font><br>

                </div>

                <div><br>

                </div>

                <div>This is expected.</div>

                <div><br>

                </div>

                <div>Why is slurm keeps trying to contact the node

                  that's already removed? slurm_suspend.sh does the

                  following to "remove" node from the partition.</div>

                <div><font face="monospace">> scontrol update

                    nodename=${host} nodeaddr="(null)"</font></div>

                <div>Maybe this isn't the correct way to do it? Is there

                  a way to force slurm to forget about the node? I tried

                  "scontrol update node=$node state=idle", but this only

                  works for a few minutes until slurm's ping agent kicks

                  in and marking them down again.</div>

                <div><br>

                </div>

                <div>Thanks!!</div>

                <div>Soichi </div>

                <div><br>

                </div>

                <div><br>

                </div>

                <div><br>

                </div>

                <div><br>

                </div>

                <div class="gmail_quote">

                  <div dir="ltr" class="gmail_attr">On Fri, Jul 30, 2021

                    at 2:21 PM Soichi Hayashi <<a

                      href="mailto:hayashis@iu.edu" target="_blank"

                      moz-do-not-send="true">hayashis@iu.edu</a>>

                    wrote:<br>

                  </div>

                  <blockquote class="gmail_quote" style="margin:0px 0px

                    0px 0.8ex;border-left:1px solid

                    rgb(204,204,204);padding-left:1ex">

                    <div dir="ltr">Hello. I need a help with

                      troubleshooting our slurm cluster. 

                      <div><br>

                      </div>

                      <div>I am running slurm-wlm 17.11.2 on Ubuntu 20

                        on a public cloud infrastructure (Jetstream)

                        using an elastic computing mechanism (<a

                          href="https://slurm.schedmd.com/elastic_computing.html"

                          target="_blank" moz-do-not-send="true">https://slurm.schedmd.com/elastic_computing.html</a>).

                        Our cluster works for the most part, but for

                        some reason, a few of our nodes constantly goes

                        into "down" state.

                        <div><br>

                        </div>

                        <div><font face="monospace">PARTITION AVAIL

                             TIMELIMIT   JOB_SIZE ROOT OVERSUBS    

                            GROUPS  NODES       STATE NODELIST<br>

                            cloud*       up 2-00:00:00 1-infinite   no  

                             YES:4        all     10       idle~

                            slurm9-compute[1-5,10,12-15]<br>

                            cloud*       up 2-00:00:00 1-infinite   no  

                             YES:4        all      5        down

                            slurm9-compute[6-9,11]</font><br>

                          <div><br>

                          </div>

                          <div>The only log I see in the slurm log is

                            this..</div>

                          <div><br>

                          </div>

                          <div><font face="monospace">[2021-07-30T15:10:55.889]

                              Invalid node state transition requested

                              for node slurm9-compute6 from=COMPLETING

                              to=RESUME<br>

                              [2021-07-30T15:21:37.339] Invalid node

                              state transition requested for node

                              slurm9-compute6 from=COMPLETING* to=RESUME<br>

                              [2021-07-30T15:27:30.039] update_node:

                              node slurm9-compute6 reason set to:

                              completing<br>

                              [2021-07-30T15:27:30.040] update_node:

                              node slurm9-compute6 state set to DOWN<br>

                              [2021-07-30T15:27:40.830] update_node:

                              node slurm9-compute6 state set to IDLE</font><br>

                          </div>

                          <div>..</div>

                          <div>[2021-07-30T15:34:20.628] error: Nodes

                            slurm9-compute[6-9,11] not responding,

                            setting DOWN<br>

                          </div>

                        </div>

                        <div><br>

                        </div>

                        <div>WIth elastic computing, any unused nodes

                          are automatically removed

                          (by SuspendProgram=/usr/local/sbin/slurm_suspend.sh).

                          So nodes are *expected* to not respond once

                          they are removed, but they should not be

                          marked as DOWN. They should simply be set to

                          "idle". </div>

                      </div>

                      <div><br>

                      </div>

                      <div>To work around this issue, I am running the

                        following cron job.</div>

                      <div><br>

                      </div>

                      <div><font face="monospace">0 0 * * * scontrol

                          update node=slurm9-compute[1-30] state=resume</font><br>

                      </div>

                      <div><br>

                      </div>

                      <div>This "works" somewhat.. but our nodes go to

                        "DOWN" state so often that running this every

                        hour is not enough.</div>

                      <div><br>

                      </div>

                      <div>Here is the full content of our slurm.conf</div>

                      <div><br>

                      </div>

                      <div><font face="monospace">root@slurm9:~# cat

                          /etc/slurm-llnl/slurm.conf <br>

                          ClusterName=slurm9<br>

                          ControlMachine=slurm9<br>

                          <br>

                          SlurmUser=slurm<br>

                          SlurmdUser=root<br>

                          SlurmctldPort=6817<br>

                          SlurmdPort=6818<br>

                          AuthType=auth/munge<br>

                          StateSaveLocation=/tmp<br>

                          SlurmdSpoolDir=/tmp/slurmd<br>

                          SwitchType=switch/none<br>

                          MpiDefault=none<br>

SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid<br>

                          SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid<br>

                          ProctrackType=proctrack/pgid<br>

                          ReturnToService=1<br>

                          Prolog=/usr/local/sbin/slurm_prolog.sh<br>

                          <br>

                          #<br>

                          # TIMERS<br>

                          SlurmctldTimeout=300<br>

                          SlurmdTimeout=300<br>

                          #make slurm a little more tolerant here<br>

                          MessageTimeout=30<br>

                          TCPTimeout=15<br>

                          BatchStartTimeout=20<br>

                          GetEnvTimeout=20<br>

                          InactiveLimit=0<br>

                          MinJobAge=604800<br>

                          KillWait=30<br>

                          Waittime=0<br>

                          #<br>

                          # SCHEDULING<br>

                          SchedulerType=sched/backfill<br>

                          SelectType=select/cons_res<br>

                          SelectTypeParameters=CR_CPU_Memory<br>

                          #FastSchedule=0<br>

                          <br>

                          # LOGGING<br>

                          SlurmctldDebug=3<br>

SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log<br>

                          SlurmdDebug=3<br>

                          SlurmdLogFile=/var/log/slurm-llnl/slurmd.log<br>

                          JobCompType=jobcomp/none<br>

                          <br>

                          # ACCOUNTING<br>

                          JobAcctGatherType=jobacct_gather/linux<br>

                          JobAcctGatherFrequency=30<br>

                          <br>

AccountingStorageType=accounting_storage/filetxt<br>

AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log<br>

                          <br>

                          #CLOUD CONFIGURATION<br>

                          PrivateData=cloud<br>

                          ResumeProgram=/usr/local/sbin/slurm_resume.sh<br>

SuspendProgram=/usr/local/sbin/slurm_suspend.sh<br>

                          ResumeRate=1 #number of nodes per minute that

                          can be created; 0 means no limit<br>

                          ResumeTimeout=900 #max time in seconds between

                          ResumeProgram running and when the node is

                          ready for use<br>

                          SuspendRate=1 #number of nodes per minute that

                          can be suspended/destroyed<br>

                          SuspendTime=600 #time in seconds before an

                          idle node is suspended<br>

                          SuspendTimeout=300 #time between running

                          SuspendProgram and the node being completely

                          down<br>

                          TreeWidth=30<br>

                          <br>

                          NodeName=slurm9-compute[1-15] State=CLOUD

                          CPUs=24 RealMemory=60388<br>

                          PartitionName=cloud LLN=YES

                          Nodes=slurm9-compute[1-15] Default=YES

                          MaxTime=48:00:00 State=UP Shared=YES</font><br>

                      </div>

                      <div><font face="monospace"><br>

                        </font></div>

                      <div><font face="arial, sans-serif">I appreciate

                          your assistance!</font></div>

                      <div><font face="arial, sans-serif"><br>

                        </font></div>

                      <div><font face="arial, sans-serif">Soichi Hayashi</font></div>

                      <div><font face="arial, sans-serif">Indiana

                          University</font></div>

                      <div><font face="arial, sans-serif"><br>

                        </font></div>

                      <div><font face="monospace"><br>

                        </font></div>

                    </div>

                  </blockquote>

                </div>

              </div>

            </blockquote>

          </div>

        </blockquote>

      </div>

    </blockquote>

  </body>

</html>