<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Soichi,</p>

    <p>(I added a subject)</p>

    <p>You want to do 'sinfo -R' to find out the reason they are going

      down.</p>

    <p>You may also want to bump up your logging to get more info. It

      looks like they are getting stuck in a 'completing' state. You

      will want to look at your slurmd logs on the nodes themselves for

      further info as to why.</p>

    <div class="moz-cite-prefix">Brian Andrus</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">On 7/30/2021 11:21 AM, Soichi Hayashi

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAGLeeFhYS2BETBVP2UD8KrhKtKSBxTAQzjUpS_sy8ggObHtqYw@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">Hello. I need a help with troubleshooting our slurm

        cluster. 

        <div><br>

        </div>

        <div>I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public

          cloud infrastructure (Jetstream) using an elastic computing

          mechanism (<a

            href="https://slurm.schedmd.com/elastic_computing.html"

            moz-do-not-send="true">https://slurm.schedmd.com/elastic_computing.html</a>).

          Our cluster works for the most part, but for some reason, a

          few of our nodes constantly goes into "down" state.

          <div><br>

          </div>

          <div><font face="monospace">PARTITION AVAIL  TIMELIMIT  

              JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE

              NODELIST<br>

              cloud*       up 2-00:00:00 1-infinite   no    YES:4      

               all     10       idle~ slurm9-compute[1-5,10,12-15]<br>

              cloud*       up 2-00:00:00 1-infinite   no    YES:4      

               all      5        down slurm9-compute[6-9,11]</font><br>

            <div><br>

            </div>

            <div>The only log I see in the slurm log is this..</div>

            <div><br>

            </div>

            <div><font face="monospace">[2021-07-30T15:10:55.889]

                Invalid node state transition requested for node

                slurm9-compute6 from=COMPLETING to=RESUME<br>

                [2021-07-30T15:21:37.339] Invalid node state transition

                requested for node slurm9-compute6 from=COMPLETING*

                to=RESUME<br>

                [2021-07-30T15:27:30.039] update_node: node

                slurm9-compute6 reason set to: completing<br>

                [2021-07-30T15:27:30.040] update_node: node

                slurm9-compute6 state set to DOWN<br>

                [2021-07-30T15:27:40.830] update_node: node

                slurm9-compute6 state set to IDLE</font><br>

            </div>

            <div>..</div>

            <div>[2021-07-30T15:34:20.628] error: Nodes

              slurm9-compute[6-9,11] not responding, setting DOWN<br>

            </div>

          </div>

          <div><br>

          </div>

          <div>WIth elastic computing, any unused nodes are

            automatically removed

            (by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So

            nodes are *expected* to not respond once they are removed,

            but they should not be marked as DOWN. They should simply be

            set to "idle". </div>

        </div>

        <div><br>

        </div>

        <div>To work around this issue, I am running the following cron

          job.</div>

        <div><br>

        </div>

        <div><font face="monospace">0 0 * * * scontrol update

            node=slurm9-compute[1-30] state=resume</font><br>

        </div>

        <div><br>

        </div>

        <div>This "works" somewhat.. but our nodes go to "DOWN" state so

          often that running this every hour is not enough.</div>

        <div><br>

        </div>

        <div>Here is the full content of our slurm.conf</div>

        <div><br>

        </div>

        <div><font face="monospace">root@slurm9:~# cat

            /etc/slurm-llnl/slurm.conf <br>

            ClusterName=slurm9<br>

            ControlMachine=slurm9<br>

            <br>

            SlurmUser=slurm<br>

            SlurmdUser=root<br>

            SlurmctldPort=6817<br>

            SlurmdPort=6818<br>

            AuthType=auth/munge<br>

            StateSaveLocation=/tmp<br>

            SlurmdSpoolDir=/tmp/slurmd<br>

            SwitchType=switch/none<br>

            MpiDefault=none<br>

            SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid<br>

            SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid<br>

            ProctrackType=proctrack/pgid<br>

            ReturnToService=1<br>

            Prolog=/usr/local/sbin/slurm_prolog.sh<br>

            <br>

            #<br>

            # TIMERS<br>

            SlurmctldTimeout=300<br>

            SlurmdTimeout=300<br>

            #make slurm a little more tolerant here<br>

            MessageTimeout=30<br>

            TCPTimeout=15<br>

            BatchStartTimeout=20<br>

            GetEnvTimeout=20<br>

            InactiveLimit=0<br>

            MinJobAge=604800<br>

            KillWait=30<br>

            Waittime=0<br>

            #<br>

            # SCHEDULING<br>

            SchedulerType=sched/backfill<br>

            SelectType=select/cons_res<br>

            SelectTypeParameters=CR_CPU_Memory<br>

            #FastSchedule=0<br>

            <br>

            # LOGGING<br>

            SlurmctldDebug=3<br>

            SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log<br>

            SlurmdDebug=3<br>

            SlurmdLogFile=/var/log/slurm-llnl/slurmd.log<br>

            JobCompType=jobcomp/none<br>

            <br>

            # ACCOUNTING<br>

            JobAcctGatherType=jobacct_gather/linux<br>

            JobAcctGatherFrequency=30<br>

            <br>

            AccountingStorageType=accounting_storage/filetxt<br>

            AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log<br>

            <br>

            #CLOUD CONFIGURATION<br>

            PrivateData=cloud<br>

            ResumeProgram=/usr/local/sbin/slurm_resume.sh<br>

            SuspendProgram=/usr/local/sbin/slurm_suspend.sh<br>

            ResumeRate=1 #number of nodes per minute that can be

            created; 0 means no limit<br>

            ResumeTimeout=900 #max time in seconds between ResumeProgram

            running and when the node is ready for use<br>

            SuspendRate=1 #number of nodes per minute that can be

            suspended/destroyed<br>

            SuspendTime=600 #time in seconds before an idle node is

            suspended<br>

            SuspendTimeout=300 #time between running SuspendProgram and

            the node being completely down<br>

            TreeWidth=30<br>

            <br>

            NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24

            RealMemory=60388<br>

            PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15]

            Default=YES MaxTime=48:00:00 State=UP Shared=YES</font><br>

        </div>

        <div><font face="monospace"><br>

          </font></div>

        <div><font face="arial, sans-serif">I appreciate your

            assistance!</font></div>

        <div><font face="arial, sans-serif"><br>

          </font></div>

        <div><font face="arial, sans-serif">Soichi Hayashi</font></div>

        <div><font face="arial, sans-serif">Indiana University</font></div>

        <div><font face="arial, sans-serif"><br>

          </font></div>

        <div><font face="monospace"><br>

          </font></div>

      </div>

    </blockquote>

  </body>

</html>