[slurm-users] Node can't run simple job when STATUS is up and STATE is idle

Mon Jan 20 20:56:30 UTC 2020

If I run sinfo on the node itself it shows an asterisk.  How can the node
be unreachable from itself?

On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy <minibit at gmail.com> wrote:

> Hi,
>
> The * next to the idle status in sinfo means that the node is
> unreachable/not responding. Check the status of the slurmd on the node and
> check the connectivity from the slurmctld host to the compute node (telnet
> may be enough). You can also check the slurmctld logs for more information.
>
> Regards,
> Carlos
>
> On Mon, 20 Jan 2020 at 21:04, Dean Schulze <dean.w.schulze at gmail.com>
> wrote:
>
>> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
>> code base.  It's behavior is strange to say the least.
>>
>> The controller was built from the same code base, but on Ubuntu 19.10.
>> The controller reports the nodes state with sinfo, but can't run a simple
>> job with srun because it thinks the node isn't available, even when it is
>> idle.  (And squeue shows an empty queue.)
>>
>> On the controller:
>> $ srun -N 1 hostname
>> srun: Required node not available (down, drained or reserved)
>> srun: job 30 queued and waiting for resources
>> ^Csrun: Job allocation 30 has been revoked
>> srun: Force Terminated job 30
>> $ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*       up   infinite      1  idle* liqidos-dean-node1
>> $ squeue
>>              JOBID  PARTITION      USER  ST        TIME   NODES
>> NODELIST(REASON)
>>
>>
>> When I try to run the simple job on the node I get:
>>
>> [liqid at liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*       up   infinite      1  idle* liqidos-dean-node1
>> [liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
>> srun: Required node not available (down, drained or reserved)
>> srun: job 27 queued and waiting for resources
>> ^Csrun: Job allocation 27 has been revoked
>> [liqid at liqidos-dean-node1 ~]$ squeue
>>              JOBID  PARTITION      USER  ST        TIME   NODES
>> NODELIST(REASON)
>> [liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
>> srun: Required node not available (down, drained or reserved)
>> srun: job 28 queued and waiting for resources
>> ^Csrun: Job allocation 28 has been revoked
>> [liqid at liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*       up   infinite      1  idle* liqidos-dean-node1
>>
>> Apparently slurm thinks there are a bunch of jobs queued, but shows an
>> empty queue.  How do I get rid of these?
>>
>> If these zombie jobs aren't the problem what else could be keeping this
>> from running?
>>
>> Thanks.
>>
> --
> --
> Carles Fenoy
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200120/27089d5d/attachment.htm>