[slurm-users] Node can't run simple job when STATUS is up and STATE is idle

Mon Jan 20 21:41:39 UTC 2020

Check the slurmd log file on the node.

Ensure slurmd is still running. Sounds possible that OOM Killer or such 
may be killing slurmd

Brian Andrus

On 1/20/2020 1:12 PM, Dean Schulze wrote:
> If I restart slurmd the asterisk goes away.  Then I can run the job 
> once and the asterisk is back, and the node remains in comp*:
>
> [liqid at liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*       up   infinite      1   idle liqidos-dean-node1
> [liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
> liqidos-dean-node1
> [liqid at liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*       up   infinite      1  comp* liqidos-dean-node1
>
> I can get it back to idle* with scontrol:
>
> [liqid at liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update 
> NodeName=liqidos-dean-node1 State=down Reason=none
> [liqid at liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update 
> NodeName=liqidos-dean-node1 State=resume
> [liqid at liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*       up   infinite      1  idle* liqidos-dean-node1
>
> I'm beginning to wonder if I got some bad code from github.
>
>
> On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy <minibit at gmail.com 
> <mailto:minibit at gmail.com>> wrote:
>
>     Hi,
>
>     The * next to the idle status in sinfo means that the node is
>     unreachable/not responding. Check the status of the slurmd on the
>     node and check the connectivity from the slurmctld host to the
>     compute node (telnet may be enough). You can also check the
>     slurmctld logs for more information.
>
>     Regards,
>     Carlos
>
>     On Mon, 20 Jan 2020 at 21:04, Dean Schulze
>     <dean.w.schulze at gmail.com <mailto:dean.w.schulze at gmail.com>> wrote:
>
>         I've got a node running on CentOS 7.7 build from the recent
>         20.02.0pre1 code base.  It's behavior is strange to say the
>         least.
>
>         The controller was built from the same code base, but on
>         Ubuntu 19.10.  The controller reports the nodes state with
>         sinfo, but can't run a simple job with srun because it thinks
>         the node isn't available, even when it is idle.  (And squeue
>         shows an empty queue.)
>
>         On the controller:
>         $ srun -N 1 hostname
>         srun: Required node not available (down, drained or reserved)
>         srun: job 30 queued and waiting for resources
>         ^Csrun: Job allocation 30 has been revoked
>         srun: Force Terminated job 30
>         $ sinfo
>         PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>         debug*       up   infinite      1  idle* liqidos-dean-node1
>         $ squeue
>                      JOBID  PARTITION      USER  ST  TIME   NODES
>         NODELIST(REASON)
>
>
>         When I try to run the simple job on the node I get:
>
>         [liqid at liqidos-dean-node1 ~]$ sinfo
>         PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>         debug*       up   infinite      1  idle* liqidos-dean-node1
>         [liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
>         srun: Required node not available (down, drained or reserved)
>         srun: job 27 queued and waiting for resources
>         ^Csrun: Job allocation 27 has been revoked
>         [liqid at liqidos-dean-node1 ~]$ squeue
>                      JOBID  PARTITION      USER  ST  TIME   NODES
>         NODELIST(REASON)
>         [liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
>         srun: Required node not available (down, drained or reserved)
>         srun: job 28 queued and waiting for resources
>         ^Csrun: Job allocation 28 has been revoked
>         [liqid at liqidos-dean-node1 ~]$ sinfo
>         PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>         debug*       up   infinite      1  idle* liqidos-dean-node1
>
>         Apparently slurm thinks there are a bunch of jobs queued, but
>         shows an empty queue.  How do I get rid of these?
>
>         If these zombie jobs aren't the problem what else could be
>         keeping this from running?
>
>         Thanks.
>
>     -- 
>     --
>     Carles Fenoy
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200120/f1e2d36f/attachment.htm>