[slurm-users] slurmctld removing offline nodes
xand.meaden at kcl.ac.uk
Tue Oct 25 19:15:25 UTC 2022
The nodes are being removed as they aren't resolving in DNS anymore; are you using a dynamic system where only active hosts' names resolve?
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Joe Teumer <joe.teumer at gmail.com>
Sent: Tuesday, October 25, 2022 7:42:16 PM
To: slurm-users at schedmd.com <slurm-users at schedmd.com>
Subject: [slurm-users] slurmctld removing offline nodes
We noticed that the slurm controller will remove nodes that it cannot reach.
How can this be disabled?
We would like to see the nodes marked down/drain instead of the controller removing the nodes from sinfo.
[2022-10-25T13:10:01.500] debug: Log file re-opened
[2022-10-25T13:10:01.589] error: get_addr_info: getaddrinfo() failed: Temporary failure in name resolution
[2022-10-25T13:10:01.589] error: slurm_set_addr: Unable to resolve "spg-ethx-f4ce"
[2022-10-25T13:10:01.589] error: slurm_get_port: Address family '0' not supported
[2022-10-25T13:10:01.589] error: _set_slurmd_addr: failure on spg-ethx-f4ce
cat /etc/slurm/slurm.conf | grep -i f4ce
PartitionName=debug spg-ethx-f4ce ...
No output in sinfo:
sinfo -N | grep f4ce
sinfo -R | grep f4ce
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users