hi everybody, i try to user dynamic mode with configless mode with slurm 24.11.3 and upgrade to 24.11.4
with slurmd and i found a problem. slurmctld is a container with docker, and my node is outside the container network .
slurmctld register my ip with a function getpeeraddr on the slurmctld socket. but my ip connected to the socket come from the docker nat/bridge so slurmctld register my ip bridged ( not my real ip ) that is to say the docker gateway (172.20.0.1)
*scontrol show node* *-------------------------*
*NodeName=ltlsbubble1 Arch=x86_64 CoresPerSocket=4..NodeAddr=172.20.0.1 NodeHostName=ltlsbubble1 Version=24.11.4*
so the node go down after the "not pinging it" timeout
i try to update the config *scontrol uupdate NodeName=ltlsbubble1* *NodeAddr=xx.xx.xx.xx*
but a the first
*scontrol reconfigure *
it comes back to : *NodeAddr=172.20.0.1*
in normal mode -------------------
*scontrol show nodeNodeName=ltlsbubble1 Arch=x86_64 CoresPerSocket=4..NodeAddr=ltlsbubble1 NodeHostName=ltlsbubble1 Version=24.11.4*
in normal mode NodeAddr is the same than NodeName , so it use DNS resolution for communication.
to verify my hypothesis, i go to the c code of slurm, identify the register function and replace it with the same mechanism than normal node
in src/slurmctld/node_mgr.c
i replace : set_node_comm_name(node_ptr, *comm_name*, reg_msg->hostname); by set_node_comm_name(node_ptr, NULL, reg_msg->hostname);
i rebuild slutmctld with this patch and try it with dynamic mode , it works like expected
*scontrol show nodeNodeName=ltlsbubble1 Arch=x86_64 CoresPerSocket=4..NodeAddr=ltlsbubble1 NodeHostName=ltlsbubble1 Version=24.1*
no ip in nodeAddr , but only the nodename, so it use DNS resolution . the node works fine and no goes down for timeout ping
so my question : can we have an option to force DNS resolution instead ip discover in Dynamic mode ? ( i try the option cloud_dns, but it not seems the purpose of this option)
best regard, Stephane