<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Check the slurmd log file on the node.</p>

    <p>Ensure slurmd is still running. Sounds possible that OOM Killer

      or such may be killing slurmd</p>

    <p>Brian Andrus<br>

    </p>

    <div class="moz-cite-prefix">On 1/20/2020 1:12 PM, Dean Schulze

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CA+LiX6GZBfGV8hqPg_b6ATf2aO5Hp5gBg2dSW1VawyFhAfj4Og@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">If I restart slurmd the asterisk goes away.  Then I

        can run the job once and the asterisk is back, and the node

        remains in comp*:

        <div><br>

        </div>

        <div>[liqid@liqidos-dean-node1 ~]$ sinfo<br>

          PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST <br>

          debug*       up   infinite      1   idle liqidos-dean-node1 <br>

          [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname<br>

          liqidos-dean-node1<br>

          [liqid@liqidos-dean-node1 ~]$ sinfo<br>

          PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST <br>

          debug*       up   infinite      1  comp* liqidos-dean-node1 <br>

        </div>

        <div><br>

        </div>

        <div>I can get it back to idle* with scontrol:</div>

        <div><br>

        </div>

        <div>[liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol

          update NodeName=liqidos-dean-node1 State=down Reason=none<br>

          [liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol

          update NodeName=liqidos-dean-node1 State=resume<br>

          [liqid@liqidos-dean-node1 ~]$ sinfo<br>

          PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST <br>

          debug*       up   infinite      1  idle* liqidos-dean-node1 <br>

        </div>

        <div><br>

        </div>

        <div>I'm beginning to wonder if I got some bad code from github.</div>

        <div><br>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Mon, Jan 20, 2020 at 1:50

          PM Carlos Fenoy <<a href="mailto:minibit@gmail.com"

            moz-do-not-send="true">minibit@gmail.com</a>> wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px

          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

          <div>

            <div dir="auto">Hi,</div>

          </div>

          <div dir="auto"><br>

          </div>

          <div dir="auto">The * next to the idle status in sinfo means

            that the node is unreachable/not responding. Check the

            status of the slurmd on the node and check the connectivity

            from the slurmctld host to the compute node (telnet may be

            enough). You can also check the slurmctld logs for more

            information. </div>

          <div dir="auto"><br>

          </div>

          <div dir="auto">Regards,</div>

          <div dir="auto">Carlos</div>

          <div><br>

            <div class="gmail_quote">

              <div dir="ltr" class="gmail_attr">On Mon, 20 Jan 2020 at

                21:04, Dean Schulze <<a

                  href="mailto:dean.w.schulze@gmail.com" target="_blank"

                  moz-do-not-send="true">dean.w.schulze@gmail.com</a>>

                wrote:<br>

              </div>

              <blockquote class="gmail_quote" style="margin:0px 0px 0px

                0.8ex;border-left:1px solid

                rgb(204,204,204);padding-left:1ex">

                <div dir="ltr">I've got a node running on CentOS 7.7

                  build from the recent 20.02.0pre1 code base.  It's

                  behavior is strange to say the least.

                  <div><br>

                  </div>

                  <div>The controller was built from the same code base,

                    but on Ubuntu 19.10.  The controller reports the

                    nodes state with sinfo, but can't run a simple job

                    with srun because it thinks the node isn't

                    available, even when it is idle.  (And squeue shows

                    an empty queue.)</div>

                  <div><br>

                  </div>

                  <div>On the controller:</div>

                  <div>$ srun -N 1 hostname<br>

                    srun: Required node not available (down, drained or

                    reserved)<br>

                    srun: job 30 queued and waiting for resources<br>

                    ^Csrun: Job allocation 30 has been revoked<br>

                    srun: Force Terminated job 30<br>

                    $ sinfo<br>

                    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST <br>

                    debug*       up   infinite      1  idle*

                    liqidos-dean-node1 <br>

                    <div>$ squeue<br>

                                   JOBID  PARTITION      USER  ST      

                       TIME   NODES NODELIST(REASON) <br>

                    </div>

                    <div><br>

                    </div>

                    <div><br>

                    </div>

                    <div>When I try to run the simple job on the node I

                      get:</div>

                    <div><br>

                    </div>

                    <div>[liqid@liqidos-dean-node1 ~]$ sinfo<br>

                      PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST

                      <br>

                      debug*       up   infinite      1  idle*

                      liqidos-dean-node1 <br>

                      [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname<br>

                      srun: Required node not available (down, drained

                      or reserved)<br>

                      srun: job 27 queued and waiting for resources<br>

                      ^Csrun: Job allocation 27 has been revoked<br>

                      [liqid@liqidos-dean-node1 ~]$ squeue<br>

                                   JOBID  PARTITION      USER  ST      

                       TIME   NODES NODELIST(REASON) <br>

                      [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname<br>

                      srun: Required node not available (down, drained

                      or reserved)<br>

                      srun: job 28 queued and waiting for resources<br>

                      ^Csrun: Job allocation 28 has been revoked<br>

                      [liqid@liqidos-dean-node1 ~]$ sinfo<br>

                      PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST

                      <br>

                      debug*       up   infinite      1  idle*

                      liqidos-dean-node1 <br>

                    </div>

                  </div>

                  <div><br>

                  </div>

                  <div>Apparently slurm thinks there are a bunch of jobs

                    queued, but shows an empty queue.  How do I get rid

                    of these?</div>

                  <div><br>

                  </div>

                  <div>If these zombie jobs aren't the problem what else

                    could be keeping this from running?</div>

                  <div><br>

                  </div>

                  <div>Thanks.</div>

                </div>

              </blockquote>

            </div>

          </div>

          -- <br>

          <div dir="ltr">--<br>

            Carles Fenoy<br>

          </div>

        </blockquote>

      </div>

    </blockquote>

  </body>

</html>