[slurm-users] Slurm management of dual-node server trays?

23 Feb 2024


      We're in the process of installing some racks with Lenovo SD665 V3 [1] 
water-cooled servers.  A Lenovo DW612S chassis contains 6 1U trays with 2 
SD665 V3 servers mounted side-by-side in each tray.
Lenovo delivers SD665 V3 servers including water-cooled NVIDIA InfiniBand 
"SharedIO" adapters [2] so that one node is the Primary including a PCIe 
adapter, and the other is Auxiliary with just a cable to the Primary's 
adapter.
Obviously, servicing 2 "Siamese twin" Slurm nodes requires a bit of care 
and planning.  What is worse is that when the Primary node is rebooted or 
powered down, the Auxiliary node will lose its Infiniband connection and 
may have a PCIe fault or an NMI as documented in [3].  And when nodes are 
powered up, the Primary must have completed POST before the Auxiliary gets 
started.  I wonder how to best deal with power failures?
It seems that when Slurm jobs are running on Auxiliary nodes, these jobs 
are going to crash when the possibly unrelated Primary node goes down.
This looks like a pretty bad system design on the part of Lenovo :-(  The 
goal was apparently to same some money on IB adapters and having fewer IB 
cables.
Question: Do any Slurm sites out there already have experiences with 
Lenovo "Siamese twin" nodes with SharedIO IB?  Have you developed some 
operational strategies, for example dealing with node pairs as a single 
entity for job scheduling?
Thanks for sharing any ideas and insights!
Ole
[1] https://lenovopress.lenovo.com/lp1612-lenovo-thinksystem-sd665-v3-server
[2] 
https://lenovopress.lenovo.com/lp1693-thinksystem-nvidia-connectx-7-ndr200-i...
[3] 
https://support.lenovo.com/us/en/solutions/ht510888-thinksystem-sd650-and-co...
-- 
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

2025

2024

[slurm-users] Slurm management of dual-node server trays?