[slurm-users] Node can't run simple job when STATUS is up and STATE is idle

Dean Schulze dean.w.schulze at gmail.com
Mon Jan 20 20:01:31 UTC 2020


I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
code base.  It's behavior is strange to say the least.

The controller was built from the same code base, but on Ubuntu 19.10.  The
controller reports the nodes state with sinfo, but can't run a simple job
with srun because it thinks the node isn't available, even when it is
idle.  (And squeue shows an empty queue.)

On the controller:
$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 30 queued and waiting for resources
^Csrun: Job allocation 30 has been revoked
srun: Force Terminated job 30
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  idle* liqidos-dean-node1
$ squeue
             JOBID  PARTITION      USER  ST        TIME   NODES
NODELIST(REASON)


When I try to run the simple job on the node I get:

[liqid at liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  idle* liqidos-dean-node1
[liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 27 queued and waiting for resources
^Csrun: Job allocation 27 has been revoked
[liqid at liqidos-dean-node1 ~]$ squeue
             JOBID  PARTITION      USER  ST        TIME   NODES
NODELIST(REASON)
[liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 28 queued and waiting for resources
^Csrun: Job allocation 28 has been revoked
[liqid at liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  idle* liqidos-dean-node1

Apparently slurm thinks there are a bunch of jobs queued, but shows an
empty queue.  How do I get rid of these?

If these zombie jobs aren't the problem what else could be keeping this
from running?

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200120/08cca5d2/attachment.htm>


More information about the slurm-users mailing list