[slurm-users] Node can't run simple job when STATUS is up and STATE is idle

Dean Schulze dean.w.schulze at gmail.com
Mon Jan 20 23:00:34 UTC 2020


There's either a problem with the source code I cloned from github, or
there is a problem when the controller runs on Ubuntu 19 and the node runs
on CentOS 7.7.  I'm downgrading to a stable 19.05 build to see if that
solves the problem.

On Mon, Jan 20, 2020 at 3:41 PM Carlos Fenoy <minibit at gmail.com> wrote:

> It seems to me that the problem is between the slurmctld and slurmd. When
> slurmd starts it sends a message to the slurmctld, that's why it appears
> idle. Every now and then the slurmctld will try to ping the slurmd to check
> if it's still alive. This ping doesn't seem to be working, so as I
> mentioned previously, check the slurmctld log and the connectivity between
> the slurmctld node and the slurmd node.
>
> On Mon, 20 Jan 2020, 22:43 Brian Andrus, <toomuchit at gmail.com> wrote:
>
>> Check the slurmd log file on the node.
>>
>> Ensure slurmd is still running. Sounds possible that OOM Killer or such
>> may be killing slurmd
>>
>> Brian Andrus
>> On 1/20/2020 1:12 PM, Dean Schulze wrote:
>>
>> If I restart slurmd the asterisk goes away.  Then I can run the job once
>> and the asterisk is back, and the node remains in comp*:
>>
>> [liqid at liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*       up   infinite      1   idle liqidos-dean-node1
>> [liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
>> liqidos-dean-node1
>> [liqid at liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*       up   infinite      1  comp* liqidos-dean-node1
>>
>> I can get it back to idle* with scontrol:
>>
>> [liqid at liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update
>> NodeName=liqidos-dean-node1 State=down Reason=none
>> [liqid at liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update
>> NodeName=liqidos-dean-node1 State=resume
>> [liqid at liqidos-dean-node1 ~]$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> debug*       up   infinite      1  idle* liqidos-dean-node1
>>
>> I'm beginning to wonder if I got some bad code from github.
>>
>>
>> On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy <minibit at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> The * next to the idle status in sinfo means that the node is
>>> unreachable/not responding. Check the status of the slurmd on the node and
>>> check the connectivity from the slurmctld host to the compute node (telnet
>>> may be enough). You can also check the slurmctld logs for more information.
>>>
>>> Regards,
>>> Carlos
>>>
>>> On Mon, 20 Jan 2020 at 21:04, Dean Schulze <dean.w.schulze at gmail.com>
>>> wrote:
>>>
>>>> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
>>>> code base.  It's behavior is strange to say the least.
>>>>
>>>> The controller was built from the same code base, but on Ubuntu 19.10.
>>>> The controller reports the nodes state with sinfo, but can't run a simple
>>>> job with srun because it thinks the node isn't available, even when it is
>>>> idle.  (And squeue shows an empty queue.)
>>>>
>>>> On the controller:
>>>> $ srun -N 1 hostname
>>>> srun: Required node not available (down, drained or reserved)
>>>> srun: job 30 queued and waiting for resources
>>>> ^Csrun: Job allocation 30 has been revoked
>>>> srun: Force Terminated job 30
>>>> $ sinfo
>>>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>>> debug*       up   infinite      1  idle* liqidos-dean-node1
>>>> $ squeue
>>>>              JOBID  PARTITION      USER  ST        TIME   NODES
>>>> NODELIST(REASON)
>>>>
>>>>
>>>> When I try to run the simple job on the node I get:
>>>>
>>>> [liqid at liqidos-dean-node1 ~]$ sinfo
>>>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>>> debug*       up   infinite      1  idle* liqidos-dean-node1
>>>> [liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
>>>> srun: Required node not available (down, drained or reserved)
>>>> srun: job 27 queued and waiting for resources
>>>> ^Csrun: Job allocation 27 has been revoked
>>>> [liqid at liqidos-dean-node1 ~]$ squeue
>>>>              JOBID  PARTITION      USER  ST        TIME   NODES
>>>> NODELIST(REASON)
>>>> [liqid at liqidos-dean-node1 ~]$ srun -N 1 hostname
>>>> srun: Required node not available (down, drained or reserved)
>>>> srun: job 28 queued and waiting for resources
>>>> ^Csrun: Job allocation 28 has been revoked
>>>> [liqid at liqidos-dean-node1 ~]$ sinfo
>>>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>>> debug*       up   infinite      1  idle* liqidos-dean-node1
>>>>
>>>> Apparently slurm thinks there are a bunch of jobs queued, but shows an
>>>> empty queue.  How do I get rid of these?
>>>>
>>>> If these zombie jobs aren't the problem what else could be keeping this
>>>> from running?
>>>>
>>>> Thanks.
>>>>
>>> --
>>> --
>>> Carles Fenoy
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200120/62057485/attachment.htm>


More information about the slurm-users mailing list