[slurm-users] 2 nodes being randomly set to "not responding"
jose at fzu.cz
jose at fzu.cz
Wed Jul 21 20:56:14 UTC 2021
Hi, most likely you might want to set it in exact opposite way, as slurm cloud scheduling guide says:
"TreeWidth Since the slurmd daemons are not aware of the network addresses of other nodes in the cloud, the slurmd daemons on each node should be sent messages directly and not forward those messages between each other. To do so, configure TreeWidth to a number at least as large as the maximum node count. The value may not exceed 65533."
Sent from Nine
From: Russell Jones <arjones85 at gmail.com>
Sent: Wednesday, 21 July 2021 22:30
To: Slurm User Community List
Subject: [slurm-users] 2 nodes being randomly set to "not responding"
We have a single slurm cluster with multiple different architectures and compute clusters talking to a single slurmctld. This slurmctld is dual-homed on two different networks. We have two individual nodes who are by themselves on "network 2" while all of the other nodes are on "network 1". They will stay online for a short period of time, but then be marked as down and not responding by slurmctld. 10 to 20 minutes later they will be back online, rinse and repeat. There are absolutely no firewalls involved anywhere in the network.
I found a mailing list post back in 2018 where a guy mentioned that slurmd's all expect to be able to talk to each other, and when you have some nodes segmented off from others you can get this flapping behavior. He mentioned to try setting TreeWidth to 1 to force the slurmd's to only communicate directly with slurmctld. I gave that a shot and it unfortunately seemed to make all of the other nodes no longer be reachable! :-)
Is there a way of properly configuring our setup so that we can have a proper dual-homed slurmctld and not require every node be reachable by every other node?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users