We've had this exact hardware for years now (all the CPU trays for Lenovo have been dual trays for the past few generations though previously they used a Y cable for connecting both). Basically the way we handle it is to drain its partner node whenever one goes down for a hardware issue.
That said you are free to reboot either node with out loss of connectivity. We do that all the time with no issues. As noted though if you want to actually physically service the nodes, then you have to take out both.
-Paul Edmon-
On 8/26/2024 8:51 AM, Ole Holm Nielsen via slurm-users wrote:
We're experimenting with ways to manage our new racks of Lenovo SD665 V3 dual-server trays with Direct Water Cooling (further information is in our Wiki page https://wiki.fysik.dtu.dk/ITwiki/Lenovo_SD665_V3/ )
Management problems arise because 2 servers share a tray with common power and water cooling. This wouldn't be so bad if it weren't for Lenovo's NVIDIA/Mellanox SharedIO Infiniband adapters, where the left-hand node's IB adapter is a client of the right-hand node's adapter. So we can't reboot or power down the right-hand node without killing any MPI jobs that happen to be using the left-hand node.
My question is if other Slurm sites owning Lenovo dual-server trays with SharedIO Infiniband adapters have developed some clever ways of handling such node pairs a single entity somehow? Is there anything we should configure on the Slurm side to make such nodes easier to manage?
Thanks for sharing any insights, Ole