We're experimenting with ways to manage our new racks of Lenovo SD665 V3 dual-server trays with Direct Water Cooling (further information is in our Wiki page https://wiki.fysik.dtu.dk/ITwiki/Lenovo_SD665_V3/ )
Management problems arise because 2 servers share a tray with common power and water cooling. This wouldn't be so bad if it weren't for Lenovo's NVIDIA/Mellanox SharedIO Infiniband adapters, where the left-hand node's IB adapter is a client of the right-hand node's adapter. So we can't reboot or power down the right-hand node without killing any MPI jobs that happen to be using the left-hand node.
My question is if other Slurm sites owning Lenovo dual-server trays with SharedIO Infiniband adapters have developed some clever ways of handling such node pairs a single entity somehow? Is there anything we should configure on the Slurm side to make such nodes easier to manage?
Thanks for sharing any insights, Ole