[slurm-users] slurmctld removing offline nodes

Meaden, Xand xand.meaden at kcl.ac.uk
Tue Oct 25 19:15:25 UTC 2022


The nodes are being removed as they aren't resolving in DNS anymore; are you using a dynamic system where only active hosts' names resolve?

Xand

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Joe Teumer <joe.teumer at gmail.com>
Sent: Tuesday, October 25, 2022 7:42:16 PM
To: slurm-users at schedmd.com <slurm-users at schedmd.com>
Subject: [slurm-users] slurmctld removing offline nodes

We noticed that the slurm controller will remove nodes that it cannot reach.
How can this be disabled?
We would like to see the nodes marked down/drain instead of the controller removing the nodes from sinfo.

/var/log/slurm/slurmctld.log
[2022-10-25T13:10:01.500] debug:  Log file re-opened
[2022-10-25T13:10:01.589] error: get_addr_info: getaddrinfo() failed: Temporary failure in name resolution
[2022-10-25T13:10:01.589] error: slurm_set_addr: Unable to resolve "spg-ethx-f4ce"
[2022-10-25T13:10:01.589] error: slurm_get_port: Address family '0' not supported
[2022-10-25T13:10:01.589] error: _set_slurmd_addr: failure on spg-ethx-f4ce

cat /etc/slurm/slurm.conf | grep -i f4ce
NodeName=spg-ethx-f4ce ...
PartitionName=debug spg-ethx-f4ce ...

No output in sinfo:
sinfo -N | grep f4ce
sinfo -R | grep f4ce

slurmd -V
slurm 21.08.0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221025/b896c107/attachment-0002.htm>


More information about the slurm-users mailing list