[slurm-users] slurmctld removing offline nodes

Joe Teumer joe.teumer at gmail.com
Tue Oct 25 20:29:58 UTC 2022


Yes, dynamic DNS.

On Tue, Oct 25, 2022 at 2:17 PM Meaden, Xand <xand.meaden at kcl.ac.uk> wrote:

> The nodes are being removed as they aren't resolving in DNS anymore; are
> you using a dynamic system where only active hosts' names resolve?
>
> Xand
>
> ------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Joe Teumer <joe.teumer at gmail.com>
> *Sent:* Tuesday, October 25, 2022 7:42:16 PM
> *To:* slurm-users at schedmd.com <slurm-users at schedmd.com>
> *Subject:* [slurm-users] slurmctld removing offline nodes
>
> We noticed that the slurm controller will remove nodes that it cannot
> reach.
> How can this be disabled?
> We would like to see the nodes marked down/drain instead of the controller
> removing the nodes from sinfo.
>
> /var/log/slurm/slurmctld.log
> [2022-10-25T13:10:01.500] debug:  Log file re-opened
> [2022-10-25T13:10:01.589] error: get_addr_info: getaddrinfo() failed:
> Temporary failure in name resolution
> [2022-10-25T13:10:01.589] error: slurm_set_addr: Unable to resolve
> "spg-ethx-f4ce"
> [2022-10-25T13:10:01.589] error: slurm_get_port: Address family '0' not
> supported
> [2022-10-25T13:10:01.589] error: _set_slurmd_addr: failure on spg-ethx-f4ce
>
> cat /etc/slurm/slurm.conf | grep -i f4ce
> NodeName=spg-ethx-f4ce ...
> PartitionName=debug spg-ethx-f4ce ...
>
> No output in sinfo:
> sinfo -N | grep f4ce
> sinfo -R | grep f4ce
>
> slurmd -V
> slurm 21.08.0
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221025/4c1f7db7/attachment.htm>


More information about the slurm-users mailing list