Hi there,
Using slurm v24.11.0 together with openMPI 5.0.7 built with openpmix v5.0.6 i am facing a systematical crash at process wiring-up phase when launching standard MPI job (OSU benchmarks ) on our new AMD compute nodes ( amd-epyc 9654, 192 phys. cores +HT ) running Rocky Linux 9.4 OS
The typical error reads:
slurmstepd: error: mpi/pmix_v5: pmixp_p2p_send: ccexe0094 [4]: pmixp_utils.c:469: send failed, rc=1001, exceeded the retry limit slurmstepd: error: mpi/pmix_v5: _slurm_send: ccexe0094 [4]: pmixp_server.c:1581: Cannot send message to /var/spool/slurmd/stepd.slurm.pmix.656.0, size = 46979, hostlist: (null) srun: error: Node failure on ccexe0091
after such a error as you can see the node move to state down It looks like the slurmstep pmix_server can not use the local socket at var/spool/slurmd/stepd.slurm.pmix.job_id.0 for inter-node communication .
* On one AMD node ( same SLURM version, same cluster setup ) wiring up works smoothly even at core satuation (192 cores used) * On Intel node (intel,xeon,gold6248r, 48 cores ) wiring-up works even with multiple node without any problem * When the problematic AMD nodes are setup as dynamic nodehttps://slurm.schedmd.com/dynamic_nodes.html
the wiring-up phase with multiple nodes works perfectly, without any issue
Has anybody experienced this kind of problem? Any idea what could be the reason for that?
I also add that when the problematic AMD nodes are setup as dynamic nodehttps://slurm.schedmd.com/dynamic_nodes.html the wiring-up phase with multiple nodes works perfectly, without any issue
Cheers,
Denis
*
--------- Denis Bertini Abteilung: CIT Ort: SB3 2.265a
Tel: +49 6159 71 2240 Fax: +49 6159 71 2986 E-Mail: d.bertini@gsi.de
GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de
Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 Managing Directors / Geschäftsführung: Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats: Ministerialdirigent Dr. Volkmar Dietz