<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Oh, also ensure the dns is working properly on the node. It could
      be that it isn't able to map the name to ip of the master.</p>
    <p>Brian Andrus</p>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 6/4/2021 9:31 AM, Herc Silverstein
      wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:1ceaf369-bdf0-fcf5-322a-8016c2b0a991@schrodinger.com">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <div class="moz-text-flowed" style="font-family: -moz-fixed;
        font-size: 13px;" lang="x-unicode">Hi, <br>
        <br>
        The slurmctld.log shows (for this node): <br>
        <br>
        ... <br>
        <br>
        [2021-05-25T00:12:27.481] sched: Allocate JobId=3402729
        NodeList=gpu-t4-4x-ondemand-44 #CPUs=1
        Partition=gpu-t4-4x-ondemand <br>
        [2021-05-25T00:12:27.482] sched: Allocate JobId=3402730
        NodeList=gpu-t4-4x-ondemand-44 #CPUs=1
        Partition=gpu-t4-4x-ondemand <br>
        [2021-05-25T00:12:27.482] sched: Allocate JobId=3402731
        NodeList=gpu-t4-4x-ondemand-44 #CPUs=1
        Partition=gpu-t4-4x-ondemand <br>
        [2021-05-25T00:15:29.043] error: Nodes gpu-t4-4x-ondemand-44 not
        responding <br>
        <br>
        <br>
        sinfo -R initially doesn't show it as problematic. Though I see
        it goes into: <br>
        <br>
        gpu-t4-4x-ondemand                 up infinite      1  comp*
        gpu-t4-4x-ondemand-44 <br>
        <br>
        However, the node where slurmctld is running knows about it: <br>
        <br>
         host gpu-t4-4x-ondemand-44 <br>
        gpu-t4-4x-ondemand-44.virtual-cluster.local has address
        10.4.64.11 <br>
        <br>
        and I can log in to the node: <br>
        <br>
        # systemctl status slurmd <br>
        \u25cf slurmd.service - Slurm node daemon <br>
           Loaded: loaded (/usr/lib/systemd/system/slurmd.service;
        disabled; vendor preset: disabled) <br>
           Active: active (running) since Tue 2021-05-25 00:12:24 UTC;
        48s ago <br>
          Process: 1874 ExecStart=/opt/slurm/sbin/slurmd $SLURMD_OPTIONS
        (code=exited, status=0/SUCCESS) <br>
         Main PID: 1876 (slurmd) <br>
            Tasks: 1 <br>
           Memory: 11.6M <br>
           CGroup: /system.slice/slurmd.service <br>
                   \u2514\u25001876 /opt/slurm/sbin/slurmd -f
        /etc/slurm/slurm.conf <br>
        <br>
        May 25 00:12:23 gpu-t4-4x-ondemand-44.virtual-cluster.local
        systemd[1]: Starting Slurm node daemon... <br>
        May 25 00:12:24 gpu-t4-4x-ondemand-44.virtual-cluster.local
        systemd[1]: Can't open PID file /var/run/slurmd.pid (yet?)...ory
        <br>
        May 25 00:12:24 gpu-t4-4x-ondemand-44.virtual-cluster.local
        systemd[1]: Started Slurm node daemon. <br>
        Hint: Some lines were ellipsized, use -l to show in full. <br>
        <br>
        later: <br>
        <br>
        sinfo: <br>
        <br>
        gpu-t4-4x-ondemand                 up infinite      1  idle*
        gpu-t4-4x-ondemand-44 <br>
        <br>
        root@service(eigen2):log# sinfo -R <br>
        REASON               USER      TIMESTAMP           NODELIST <br>
        Not responding       slurm     2021-05-25T00:45:40
        gpu-t4-4x-ondemand-44 <br>
        <br>
        and slurmctld.log: <br>
        <br>
        [2021-05-25T00:15:29.043] error: Nodes gpu-t4-4x-ondemand-44 not
        responding <br>
        [2021-05-25T00:19:16.397] error: Nodes gpu-t4-4x-ondemand-44 not
        responding, setting DOWN <br>
        [2021-05-25T00:20:02.092] powering down node
        gpu-t4-4x-ondemand-44 <br>
        [2021-05-25T00:20:08.438] error: Nodes gpu-t4-4x-ondemand-44 not
        responding, setting DOWN <br>
        [2021-05-25T00:25:02.931] powering down node
        gpu-t4-4x-ondemand-44 <br>
        [2021-05-25T00:25:04.903] error: Nodes gpu-t4-4x-ondemand-44 not
        responding, setting DOWN <br>
        [2021-05-25T00:30:01.247] powering down node
        gpu-t4-4x-ondemand-44 <br>
        [2021-05-25T00:31:21.479] error: Nodes gpu-t4-4x-ondemand-44 not
        responding, setting DOWN <br>
        [2021-05-25T00:35:01.359] powering down node
        gpu-t4-4x-ondemand-44 <br>
        [2021-05-25T00:35:41.756] error: Nodes gpu-t4-4x-ondemand-44 not
        responding, setting DOWN <br>
        [2021-05-25T00:40:01.671] powering down node
        gpu-t4-4x-ondemand-44 <br>
        [2021-05-25T00:40:41.225] error: Nodes gpu-t4-4x-ondemand-44 not
        responding, setting DOWN <br>
        [2021-05-25T00:45:01.430] powering down node
        gpu-t4-4x-ondemand-44 <br>
        [2021-05-25T00:45:40.071] error: Nodes gpu-t4-4x-ondemand-44 not
        responding, setting DOWN <br>
      </div>
      <p><br>
        This makes sense given what it thinks the state is.  However,
        it's unclear why it thinks it's non-responding given that slurmd
        is running and that it can be logged into.  <br>
      </p>
      <p> Herc</p>
      <p><br>
      </p>
    </blockquote>
  </body>
</html>