Hi All,
We're currently in the process of setting up SLURM on a RHEL 8.9 based cluster. Here's a summary of the steps we've taken so far:
Installed MLNX OFED ConnectX-5.2. Compiled and installed PMiX and UCX. Compiled and installed Slurm with PMiX_v4 and UCX support. Compiled OpenMPI with SLURM, PMIx, libevent, and hwloc support. All compute nodes are reachable via the IB network.
*Problem:* While hello world MPI jobs are working fine on multiple nodes, the jobs are not utilizing Infiniband.
srun --mpi=pmix -N2 -n2 --ntasks-per-node=2 ./hello > log.out 2>&1
Output from srun --mpi=list:
MPI plugin types are... none cray_shasta pmi2 pmix specific pmix plugin versions available: pmix_v4
Could someone please point me in the right direction on how to troubleshoot this issue?
Thank you for your assistance.
Sudhakar
How was your binary compiled? If it is dynamically linked, please reply with the ldd listing of the binary ( ldd binary )
Jenny
From: S L via slurm-users slurm-users@lists.schedmd.com Sent: Tuesday, February 20, 2024 10:55 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] RHEL 8.9+SLURM-23.11.3+MLNX_OFED_LINUX-23.10-1.1.9.0+ OpenMPI-5.0.2
You don't often get email from slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com. Learn why this is importanthttps://aka.ms/LearnAboutSenderIdentification Hi All,
We're currently in the process of setting up SLURM on a RHEL 8.9 based cluster. Here's a summary of the steps we've taken so far:
Installed MLNX OFED ConnectX-5.2. Compiled and installed PMiX and UCX. Compiled and installed Slurm with PMiX_v4 and UCX support. Compiled OpenMPI with SLURM, PMIx, libevent, and hwloc support. All compute nodes are reachable via the IB network.
Problem: While hello world MPI jobs are working fine on multiple nodes, the jobs are not utilizing Infiniband.
srun --mpi=pmix -N2 -n2 --ntasks-per-node=2 ./hello > log.out 2>&1
Output from srun --mpi=list:
MPI plugin types are... none cray_shasta pmi2 pmix specific pmix plugin versions available: pmix_v4
Could someone please point me in the right direction on how to troubleshoot this issue?
Thank you for your assistance.
Sudhakar