[slurm-users] Node can't run simple job when STATUS is up and STATE is idle

Tue Jan 21 04:18:39 UTC 2020

The node is not getting the status from itself, it’s querying the slurmctld to ask for its status.

--
____
|| \\UTGERS,  	 |---------------------------*O*---------------------------
||_// the State	 |         Ryan Novosielski - novosirj at rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ	 | Office of Advanced Research Computing - MSB C630, Newark
     `'

> On Jan 20, 2020, at 3:56 PM, Dean Schulze <dean.w.schulze at gmail.com> wrote:
> 
> If I run sinfo on the node itself it shows an asterisk.  How can the node be unreachable from itself?
> 
> On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy <minibit at gmail.com> wrote:
> Hi,
> 
> The * next to the idle status in sinfo means that the node is unreachable/not responding. Check the status of the slurmd on the node and check the connectivity from the slurmctld host to the compute node (telnet may be enough). You can also check the slurmctld logs for more information. 
> 
> Regards,
> Carlos
> 
> On Mon, 20 Jan 2020 at 21:04, Dean Schulze <dean.w.schulze at gmail.com> wrote:
> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1 code base.  It's behavior is strange to say the least.
> 
> The controller was built from the same code base, but on Ubuntu 19.10.  The controller reports the nodes state with sinfo, but can't run a simple job with srun because it thinks the node isn't available, even when it is idle.  (And squeue shows an empty queue.)
> 
> On the controller:
> $ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 30 queued and waiting for resources
> ^Csrun: Job allocation 30 has been revoked
> srun: Force Terminated job 30
> $ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
> debug*       up   infinite      1  idle* liqidos-dean-node1 
> $ squeue
>              JOBID  PARTITION      USER  ST        TIME   NODES NODELIST(REASON) 
> 
> 
> When I try to run the simple job on the node I get:
> 
> [liqid at liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
> debug*       up   infinite      1  idle* liqidos-dean-node1 
> [liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 27 queued and waiting for resources
> ^Csrun: Job allocation 27 has been revoked
> [liqid at liqidos-dean-node1 ~]$ squeue
>              JOBID  PARTITION      USER  ST        TIME   NODES NODELIST(REASON) 
> [liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 28 queued and waiting for resources
> ^Csrun: Job allocation 28 has been revoked
> [liqid at liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
> debug*       up   infinite      1  idle* liqidos-dean-node1 
> 
> Apparently slurm thinks there are a bunch of jobs queued, but shows an empty queue.  How do I get rid of these?
> 
> If these zombie jobs aren't the problem what else could be keeping this from running?
> 
> Thanks.
> -- 
> --
> Carles Fenoy