[slurm-users] 2 nodes being randomly set to "not responding"

Thu Jul 22 17:44:05 UTC 2021

That appears to have fixed it. Thank you!

On Wed, Jul 21, 2021 at 3:58 PM <jose at fzu.cz> wrote:

> Hi, most likely you might want to set it in exact opposite way, as slurm
> cloud scheduling guide says:
>
> "*TreeWidth *Since the slurmd daemons are not aware of the network
> addresses of other nodes in the cloud, the slurmd daemons on each node
> should be sent messages directly and not forward those messages between
> each other. To do so, configure TreeWidth to a number at least as large as
> the maximum node count. The value may not exceed 65533."
>
> source: https://slurm.schedmd.com/elastic_computing.html
>
> Cheers
>
> Josef
>
> Sent from Nine <http://www.9folders.com/>
>
> ------------------------------
> *From:* Russell Jones <arjones85 at gmail.com>
> *Sent:* Wednesday, 21 July 2021 22:30
> *To:* Slurm User Community List
> *Subject:* [slurm-users] 2 nodes being randomly set to "not responding"
>
> Hi all,
>
> We have a single slurm cluster with multiple different architectures and
> compute clusters talking to a single slurmctld. This slurmctld is
> dual-homed on two different networks. We have two individual nodes who are
> by themselves on "network 2" while all of the other nodes are on "network
> 1".  They will stay online for a short period of time, but then be marked
> as down and not responding by slurmctld. 10 to 20 minutes later they will
> be back online, rinse and repeat. There are absolutely no firewalls
> involved anywhere in the network.
>
> I found a mailing list post back in 2018 where a guy mentioned that
> slurmd's all expect to be able to talk to each other, and when you have
> some nodes segmented off from others you can get this flapping behavior. He
> mentioned to try setting TreeWidth to 1 to force the slurmd's to only
> communicate directly with slurmctld. I gave that a shot and it
> unfortunately seemed to make all of the other nodes no longer be reachable!
> :-)
>
> Is there a way of properly configuring our setup so that we can have a
> proper dual-homed slurmctld and not require every node be reachable by
> every other node?
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210722/ebd07bbd/attachment.htm>