<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Soichi,</p>
    <p>(I added a subject)</p>
    <p>You want to do 'sinfo -R' to find out the reason they are going
      down.</p>
    <p>You may also want to bump up your logging to get more info. It
      looks like they are getting stuck in a 'completing' state. You
      will want to look at your slurmd logs on the nodes themselves for
      further info as to why.</p>
    <div class="moz-cite-prefix">Brian Andrus</div>
    <div class="moz-cite-prefix"><br>
    </div>
    <div class="moz-cite-prefix">On 7/30/2021 11:21 AM, Soichi Hayashi
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAGLeeFhYS2BETBVP2UD8KrhKtKSBxTAQzjUpS_sy8ggObHtqYw@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">Hello. I need a help with troubleshooting our slurm
        cluster. 
        <div><br>
        </div>
        <div>I am running slurm-wlm 17.11.2 on Ubuntu 20 on a public
          cloud infrastructure (Jetstream) using an elastic computing
          mechanism (<a
            href="https://slurm.schedmd.com/elastic_computing.html"
            moz-do-not-send="true">https://slurm.schedmd.com/elastic_computing.html</a>).
          Our cluster works for the most part, but for some reason, a
          few of our nodes constantly goes into "down" state.
          <div><br>
          </div>
          <div><font face="monospace">PARTITION AVAIL  TIMELIMIT  
              JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE
              NODELIST<br>
              cloud*       up 2-00:00:00 1-infinite   no    YES:4      
               all     10       idle~ slurm9-compute[1-5,10,12-15]<br>
              cloud*       up 2-00:00:00 1-infinite   no    YES:4      
               all      5        down slurm9-compute[6-9,11]</font><br>
            <div><br>
            </div>
            <div>The only log I see in the slurm log is this..</div>
            <div><br>
            </div>
            <div><font face="monospace">[2021-07-30T15:10:55.889]
                Invalid node state transition requested for node
                slurm9-compute6 from=COMPLETING to=RESUME<br>
                [2021-07-30T15:21:37.339] Invalid node state transition
                requested for node slurm9-compute6 from=COMPLETING*
                to=RESUME<br>
                [2021-07-30T15:27:30.039] update_node: node
                slurm9-compute6 reason set to: completing<br>
                [2021-07-30T15:27:30.040] update_node: node
                slurm9-compute6 state set to DOWN<br>
                [2021-07-30T15:27:40.830] update_node: node
                slurm9-compute6 state set to IDLE</font><br>
            </div>
            <div>..</div>
            <div>[2021-07-30T15:34:20.628] error: Nodes
              slurm9-compute[6-9,11] not responding, setting DOWN<br>
            </div>
          </div>
          <div><br>
          </div>
          <div>WIth elastic computing, any unused nodes are
            automatically removed
            (by SuspendProgram=/usr/local/sbin/slurm_suspend.sh). So
            nodes are *expected* to not respond once they are removed,
            but they should not be marked as DOWN. They should simply be
            set to "idle". </div>
        </div>
        <div><br>
        </div>
        <div>To work around this issue, I am running the following cron
          job.</div>
        <div><br>
        </div>
        <div><font face="monospace">0 0 * * * scontrol update
            node=slurm9-compute[1-30] state=resume</font><br>
        </div>
        <div><br>
        </div>
        <div>This "works" somewhat.. but our nodes go to "DOWN" state so
          often that running this every hour is not enough.</div>
        <div><br>
        </div>
        <div>Here is the full content of our slurm.conf</div>
        <div><br>
        </div>
        <div><font face="monospace">root@slurm9:~# cat
            /etc/slurm-llnl/slurm.conf <br>
            ClusterName=slurm9<br>
            ControlMachine=slurm9<br>
            <br>
            SlurmUser=slurm<br>
            SlurmdUser=root<br>
            SlurmctldPort=6817<br>
            SlurmdPort=6818<br>
            AuthType=auth/munge<br>
            StateSaveLocation=/tmp<br>
            SlurmdSpoolDir=/tmp/slurmd<br>
            SwitchType=switch/none<br>
            MpiDefault=none<br>
            SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid<br>
            SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid<br>
            ProctrackType=proctrack/pgid<br>
            ReturnToService=1<br>
            Prolog=/usr/local/sbin/slurm_prolog.sh<br>
            <br>
            #<br>
            # TIMERS<br>
            SlurmctldTimeout=300<br>
            SlurmdTimeout=300<br>
            #make slurm a little more tolerant here<br>
            MessageTimeout=30<br>
            TCPTimeout=15<br>
            BatchStartTimeout=20<br>
            GetEnvTimeout=20<br>
            InactiveLimit=0<br>
            MinJobAge=604800<br>
            KillWait=30<br>
            Waittime=0<br>
            #<br>
            # SCHEDULING<br>
            SchedulerType=sched/backfill<br>
            SelectType=select/cons_res<br>
            SelectTypeParameters=CR_CPU_Memory<br>
            #FastSchedule=0<br>
            <br>
            # LOGGING<br>
            SlurmctldDebug=3<br>
            SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log<br>
            SlurmdDebug=3<br>
            SlurmdLogFile=/var/log/slurm-llnl/slurmd.log<br>
            JobCompType=jobcomp/none<br>
            <br>
            # ACCOUNTING<br>
            JobAcctGatherType=jobacct_gather/linux<br>
            JobAcctGatherFrequency=30<br>
            <br>
            AccountingStorageType=accounting_storage/filetxt<br>
            AccountingStorageLoc=/var/log/slurm-llnl/slurm_jobacct.log<br>
            <br>
            #CLOUD CONFIGURATION<br>
            PrivateData=cloud<br>
            ResumeProgram=/usr/local/sbin/slurm_resume.sh<br>
            SuspendProgram=/usr/local/sbin/slurm_suspend.sh<br>
            ResumeRate=1 #number of nodes per minute that can be
            created; 0 means no limit<br>
            ResumeTimeout=900 #max time in seconds between ResumeProgram
            running and when the node is ready for use<br>
            SuspendRate=1 #number of nodes per minute that can be
            suspended/destroyed<br>
            SuspendTime=600 #time in seconds before an idle node is
            suspended<br>
            SuspendTimeout=300 #time between running SuspendProgram and
            the node being completely down<br>
            TreeWidth=30<br>
            <br>
            NodeName=slurm9-compute[1-15] State=CLOUD CPUs=24
            RealMemory=60388<br>
            PartitionName=cloud LLN=YES Nodes=slurm9-compute[1-15]
            Default=YES MaxTime=48:00:00 State=UP Shared=YES</font><br>
        </div>
        <div><font face="monospace"><br>
          </font></div>
        <div><font face="arial, sans-serif">I appreciate your
            assistance!</font></div>
        <div><font face="arial, sans-serif"><br>
          </font></div>
        <div><font face="arial, sans-serif">Soichi Hayashi</font></div>
        <div><font face="arial, sans-serif">Indiana University</font></div>
        <div><font face="arial, sans-serif"><br>
          </font></div>
        <div><font face="monospace"><br>
          </font></div>
      </div>
    </blockquote>
  </body>
</html>