[slurm-users] Can't get node out of drain state

Dean Schulze dean.w.schulze at gmail.com
Fri Jan 24 03:09:31 UTC 2020


The problem turned out to be that I had Gres=gpu:gp100:1 on the NodeName
line for that node and it didn't have a gpu or a gres.conf.  Once I moved
that to the correct NodeName line in slurm.conf that node came out of the
drain state and became usable again.

Pretty strange that having a Gres= property on a node that doesn't have a
gpu would get it stuck in the drain state.



On Thu, Jan 23, 2020 at 2:34 PM Alex Chekholko <alex at calicolabs.com> wrote:

> Hey Dean,
>
> Does 'scontrol show node <nodename' give any "Reason:"?  You can also look
> at 'sinfo -R'.
>
> Make sure the relevant network ports are open:
>
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
>
> Also check that slurmd daemons on the compute nodes can talk to each other
> (not just to the master). e.g. bottom of
> https://slurm.schedmd.com/big_sys.html
>
> Regards,
> Alex
>
> On Thu, Jan 23, 2020 at 1:05 PM Dean Schulze <dean.w.schulze at gmail.com>
> wrote:
>
>> I've tried the normal things with scontrol (
>> https://blog.redbranch.net/2015/12/26/resetting-drained-slurm-node/),
>> but I have a node that will not come out of the drain state.
>>
>> I've also done a hard reboot and tried again.  Are there any other
>> remedies?
>>
>> Thanks.
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200123/aaba483c/attachment.htm>


More information about the slurm-users mailing list