[slurm-users] Node can't run simple job when STATUS is up and STATE is idle
Brian Andrus
toomuchit at gmail.com
Mon Jan 20 21:41:39 UTC 2020
Check the slurmd log file on the node.
Ensure slurmd is still running. Sounds possible that OOM Killer or such
may be killing slurmd
Brian Andrus
On 1/20/2020 1:12 PM, Dean Schulze wrote:
> If I restart slurmd the asterisk goes away. Then I can run the job
> once and the asterisk is back, and the node remains in comp*:
>
> [liqid at liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> debug* up infinite 1 idle liqidos-dean-node1
> [liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
> liqidos-dean-node1
> [liqid at liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> debug* up infinite 1 comp* liqidos-dean-node1
>
> I can get it back to idle* with scontrol:
>
> [liqid at liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update
> NodeName=liqidos-dean-node1 State=down Reason=none
> [liqid at liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update
> NodeName=liqidos-dean-node1 State=resume
> [liqid at liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> debug* up infinite 1 idle* liqidos-dean-node1
>
> I'm beginning to wonder if I got some bad code from github.
>
>
> On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy <minibit at gmail.com
> <mailto:minibit at gmail.com>> wrote:
>
> Hi,
>
> The * next to the idle status in sinfo means that the node is
> unreachable/not responding. Check the status of the slurmd on the
> node and check the connectivity from the slurmctld host to the
> compute node (telnet may be enough). You can also check the
> slurmctld logs for more information.
>
> Regards,
> Carlos
>
> On Mon, 20 Jan 2020 at 21:04, Dean Schulze
> <dean.w.schulze at gmail.com <mailto:dean.w.schulze at gmail.com>> wrote:
>
> I've got a node running on CentOS 7.7 build from the recent
> 20.02.0pre1 code base. It's behavior is strange to say the
> least.
>
> The controller was built from the same code base, but on
> Ubuntu 19.10. The controller reports the nodes state with
> sinfo, but can't run a simple job with srun because it thinks
> the node isn't available, even when it is idle. (And squeue
> shows an empty queue.)
>
> On the controller:
> $ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 30 queued and waiting for resources
> ^Csrun: Job allocation 30 has been revoked
> srun: Force Terminated job 30
> $ sinfo
> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> debug* up infinite 1 idle* liqidos-dean-node1
> $ squeue
> JOBID PARTITION USER ST TIME NODES
> NODELIST(REASON)
>
>
> When I try to run the simple job on the node I get:
>
> [liqid at liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> debug* up infinite 1 idle* liqidos-dean-node1
> [liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 27 queued and waiting for resources
> ^Csrun: Job allocation 27 has been revoked
> [liqid at liqidos-dean-node1 ~]$ squeue
> JOBID PARTITION USER ST TIME NODES
> NODELIST(REASON)
> [liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 28 queued and waiting for resources
> ^Csrun: Job allocation 28 has been revoked
> [liqid at liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> debug* up infinite 1 idle* liqidos-dean-node1
>
> Apparently slurm thinks there are a bunch of jobs queued, but
> shows an empty queue. How do I get rid of these?
>
> If these zombie jobs aren't the problem what else could be
> keeping this from running?
>
> Thanks.
>
> --
> --
> Carles Fenoy
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200120/f1e2d36f/attachment.htm>
More information about the slurm-users
mailing list