hi everybody,
i try to user dynamic mode with configless mode with slurm 24.11.3 and upgrade to 24.11.4
with slurmd and i found a problem.
slurmctld is a container with docker, and my node is outside the container network .
slurmctld register my ip with a function getpeeraddr on the slurmctld socket.
but my ip connected to the socket come from the docker nat/bridge so slurmctld register my ip bridged ( not my real ip ) that is to say the docker gateway (172.20.0.1)
scontrol show node
-------------------------
NodeName=ltlsbubble1 Arch=x86_64 CoresPerSocket=4
..
NodeAddr=172.20.0.1 NodeHostName=ltlsbubble1 Version=24.11.4
so the node go down after the "not pinging it" timeout
i try to update the config
scontrol uupdate NodeName=ltlsbubble1
NodeAddr=xx.xx.xx.xx
but a the first
scontrol reconfigure
it comes back to :
NodeAddr=172.20.0.1
in normal mode
-------------------
scontrol show node
NodeName=ltlsbubble1 Arch=x86_64 CoresPerSocket=4
..
NodeAddr=ltlsbubble1 NodeHostName=ltlsbubble1 Version=24.11.4
in normal mode NodeAddr is the same than NodeName , so it use DNS resolution for communication.
to verify my hypothesis, i go to the c code of slurm, identify the register function and replace it with the same mechanism than normal node
in src/slurmctld/node_mgr.c
i replace :
set_node_comm_name(node_ptr,
comm_name,
reg_msg->hostname);
by
set_node_comm_name(node_ptr,
NULL,
reg_msg->hostname);
i rebuild slutmctld with this patch and try it with dynamic mode , it works like expected
scontrol show node
NodeName=ltlsbubble1 Arch=x86_64 CoresPerSocket=4
..
NodeAddr=ltlsbubble1 NodeHostName=ltlsbubble1 Version=24.1
no ip in nodeAddr , but only the nodename, so it use DNS resolution . the node works fine and no goes down for timeout ping
so my question :
can we have an option to force DNS resolution instead ip discover in Dynamic mode ?
( i try the option cloud_dns, but it not seems the purpose of this option)
best regard,
Stephane