[slurm-users] Re: Slurm management of Lenovo SD665 V3 dual-server trays?

26 Aug 2024


      We built our stack using helmod, which is an extension of LMOD using rpm 
spec files.  Our spec for openmpi can be found here: 
https://github.com/fasrc/helmod/blob/master/rpmbuild/SPECS/rocky8/openmpi-5....
I've tested with both Intel and GCC and have seen no issues (we use 
Reframe for our testing: https://github.com/fasrc/reframe-fasrc).
-Paul Edmon-
On 8/26/2024 3:28 PM, Ole Holm Nielsen via slurm-users wrote:
...
On 26-08-2024 20:30, Paul Edmon via slurm-users wrote:
...
I haven't seen any behavior like that. For reference we are running 
Rocky 8.9 with MOFED 23.10.2
That's interesting!  Our nodes run Rocky 8.10 and have installed the 
Mellanox driver tar-ball 
MLNX_OFED_LINUX-24.04-0.7.0.0-rhel8.9-x86_64.tgz.  That's close to 
your setup!   User applications may use any MPI package, but very 
likely OpenMPI/4.1.5-GCC-12.3.0 from the latest EasyBuild software 
modules.
It seems that we need to make some more careful testing of multi-node 
MPI jobs while taking SD665 V3 nodes down.
I wonder if there's any additional OpenMPI or Slurm configuration in 
your setup, such as building Slurm --with pmix?
Thanks,
Ole
...
On 8/26/2024 2:23 PM, Ole Holm Nielsen via slurm-users wrote:
...
Hi Paul,
On 26-08-2024 15:29, Paul Edmon via slurm-users wrote:
...
We've had this exact hardware for years now (all the CPU trays for 
Lenovo have been dual trays for the past few generations though 
previously they used a Y cable for connecting both). Basically the 
way we handle it is to drain its partner node whenever one goes 
down for a hardware issue.
The SD665 V3 system was announced in Nov. 2022.  This V3 generation 
seems to come with a single IB cable per tray with 2 nodes.  In 
retrospect, I would wish for independent IB adapters in each node 
and an IB splitter cable (Y-cable) with 200 Gb to 2x100 Gb 
transceivers.
I agree that we can drain partner nodes in Slurm when servicing a node.
...
That said you are free to reboot either node with out loss of 
connectivity. We do that all the time with no issues. As noted 
though if you want to actually physically service the nodes, then 
you have to take out both.
What we have experienced several times is that multi-node MPI jobs, 
running on the left-hand SD665 V3 node plus other nodes in the 
cluster, crash when the right-hand node is rebooted for a kernel 
update or whatever reason.  The right-hand node of course houses the 
physical SharedIO Infiniband adapter.
My interpretation is that the IB adapter gets reset when the right- 
hand node reboots, disrupting also IB traffic to the left-hand node 
for a while and causing job crashes.
Have you seen any behavior like this?
Thanks,
Ole
...
On 8/26/2024 8:51 AM, Ole Holm Nielsen via slurm-users wrote:
...
We're experimenting with ways to manage our new racks of Lenovo 
SD665 V3 dual-server trays with Direct Water Cooling (further 
information is in our Wiki page https://wiki.fysik.dtu.dk/ITwiki/ 
Lenovo_SD665_V3/ )
Management problems arise because 2 servers share a tray with 
common power and water cooling.  This wouldn't be so bad if it 
weren't for Lenovo's NVIDIA/Mellanox SharedIO Infiniband adapters, 
where the left- hand node's IB adapter is a client of the 
right-hand node's adapter.  So we can't reboot or power down the 
right-hand node without killing any MPI jobs that happen to be 
using the left-hand node.
My question is if other Slurm sites owning Lenovo dual-server 
trays with SharedIO Infiniband adapters have developed some clever 
ways of handling such node pairs a single entity somehow?  Is 
there anything we should configure on the Slurm side to make such 
nodes easier to manage?
Thanks for sharing any insights,
Ole

2025

2024

[slurm-users] Re: Slurm management of Lenovo SD665 V3 dual-server trays?