[slurm-users] Node can't run simple job when STATUS is up and STATE is idle
Dean Schulze
dean.w.schulze at gmail.com
Mon Jan 20 20:01:31 UTC 2020
I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
code base. It's behavior is strange to say the least.
The controller was built from the same code base, but on Ubuntu 19.10. The
controller reports the nodes state with sinfo, but can't run a simple job
with srun because it thinks the node isn't available, even when it is
idle. (And squeue shows an empty queue.)
On the controller:
$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 30 queued and waiting for resources
^Csrun: Job allocation 30 has been revoked
srun: Force Terminated job 30
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle* liqidos-dean-node1
$ squeue
JOBID PARTITION USER ST TIME NODES
NODELIST(REASON)
When I try to run the simple job on the node I get:
[liqid at liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle* liqidos-dean-node1
[liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 27 queued and waiting for resources
^Csrun: Job allocation 27 has been revoked
[liqid at liqidos-dean-node1 ~]$ squeue
JOBID PARTITION USER ST TIME NODES
NODELIST(REASON)
[liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 28 queued and waiting for resources
^Csrun: Job allocation 28 has been revoked
[liqid at liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle* liqidos-dean-node1
Apparently slurm thinks there are a bunch of jobs queued, but shows an
empty queue. How do I get rid of these?
If these zombie jobs aren't the problem what else could be keeping this
from running?
Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200120/08cca5d2/attachment.htm>
More information about the slurm-users
mailing list