hi everybody,
i try to user dynamic mode with configless mode with slurm 24.11.3 and
upgrade to 24.11.4
with slurmd and i found a problem.
slurmctld is a container with docker, and my node is outside the container
network .
slurmctld register my ip with a function getpeeraddr on the slurmctld
socket.
but my ip connected to the socket come from the docker nat/bridge so
slurmctld register my ip bridged ( not my real ip ) that is to say the
docker gateway (172.20.0.1)
*scontrol show node*
*-------------------------*
*NodeName=ltlsbubble1 Arch=x86_64 CoresPerSocket=4..NodeAddr=172.20.0.1
NodeHostName=ltlsbubble1 Version=24.11.4*
so the node go down after the "not pinging it" timeout
i try to update the config
*scontrol uupdate NodeName=ltlsbubble1* *NodeAddr=xx.xx.xx.xx*
but a the first
*scontrol reconfigure *
it comes back to : *NodeAddr=172.20.0.1*
in normal mode
-------------------
*scontrol show nodeNodeName=ltlsbubble1 Arch=x86_64
CoresPerSocket=4..NodeAddr=ltlsbubble1 NodeHostName=ltlsbubble1
Version=24.11.4*
in normal mode NodeAddr is the same than NodeName , so it use DNS
resolution for communication.
to verify my hypothesis, i go to the c code of slurm, identify the
register function and replace it with the same mechanism than normal node
in src/slurmctld/node_mgr.c
i replace :
set_node_comm_name(node_ptr, *comm_name*, reg_msg->hostname);
by
set_node_comm_name(node_ptr, NULL, reg_msg->hostname);
i rebuild slutmctld with this patch and try it with dynamic mode , it works
like expected
*scontrol show nodeNodeName=ltlsbubble1 Arch=x86_64
CoresPerSocket=4..NodeAddr=ltlsbubble1 NodeHostName=ltlsbubble1
Version=24.1*
no ip in nodeAddr , but only the nodename, so it use DNS resolution . the
node works fine and no goes down for timeout ping
so my question :
can we have an option to force DNS resolution instead ip discover in
Dynamic mode ?
( i try the option cloud_dns, but it not seems the purpose of this option)
best regard,
Stephane