Hi all,
We’re running a small slurm dev cluster on Ubuntu and are facing issues with MPI/PMIx after upgrading slurm from 23.02.5 to 23.11.3.
The first job step to use MPI within a job fails roughly 80% of the time but following attempts to use MPI within the same job work fine. For the failing job step we see this error after hitting the MPI timeout:
slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_reset_if_to: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:741: 0x55a03f8d7a90: collective timeout seq=0 slurmstepd: error: mpi/pmix_v4: pmixp_coll_log: hpc-d-msh-01a02 [1]: pmixp_coll.c:286: Dumping collective state slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:759: 0x55a03f8d7a90: COLL_FENCE_RING state seq=0 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:762: my peerid: 1:hpc-d-msh-01a02 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:769: neighbor id: next 0:hpc-d-msh-01a01, prev 0:hpc-d-msh-01a01 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:778: Context ptr=0x55a03f8d7b08, #0, in-use=0 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:778: Context ptr=0x55a03f8d7b40, #1, in-use=0 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:778: Context ptr=0x55a03f8d7b78, #2, in-use=1 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:787: seq=0 contribs: loc=1/prev=1/fwd=0 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:791: neighbor contribs [2]: slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:824: done contrib: hpc-d-msh-01a01 slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:826: wait contrib: - slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:828: status=PMIXP_COLL_RING_FINILIZE slurmstepd: error: mpi/pmix_v4: pmixp_coll_ring_log: hpc-d-msh-01a02 [1]: pmixp_coll_ring.c:831: buf (offset/size): 0/33362 [hpc-d-msh-01a01.tds.hpc.barf1.com:47652] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13) [hpc-d-msh-01a01.tds.hpc.barf1.com:47652] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 31 [hpc-d-msh-01a01:47652] *** An error occurred in MPI_Send [hpc-d-msh-01a01:47652] *** reported by process [683360612,0] [hpc-d-msh-01a01:47652] *** on communicator MPI_COMM_WORLD [hpc-d-msh-01a01:47652] *** MPI_ERR_OTHER: known error not in list [hpc-d-msh-01a01:47652] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [hpc-d-msh-01a01:47652] *** and potentially your MPI job) slurmstepd: error: *** STEP 1890.0 ON hpc-d-msh-01a01 CANCELLED AT 2024-02-14T16:14:52 ***
OpenMPI/PMIx versions have not changed and downgrading slurm to 23.02.5 seems to resolve the issue. We’d appreciate any pointers anyone might have.
Thanks
Oli
This email comprises confidential information of Mercedes-Benz Grand Prix Limited ("MGP") unless it contains an explicit statement to the contrary made by an authorised representative of MGP.
Contracts may only be concluded on behalf of MGP by its authorised signatories and not solely by email communication. No employee, agent, contractor, consultant and/or other representative of MGP is authorised to conclude any legally binding agreement on behalf of MGP by email alone without the express prior written confirmation of two authorised signatories of MGP.
Mercedes-Benz Grand Prix Limited. Registered in England No. 787446. Registered Office at Mercedes-Benz Grand Prix Limited, Operations Centre, Brackley, Northants NN13 7BD.
Note: The MGP Legal Department also acts on behalf of Mercedes-Benz Motorsport Limited ("MBM") and the above notice applies mutatis mutandis in respect of all email communications of MBM. MBM: Mercedes-Benz Motorsport Limited. Registered in England No. 13057973. Registered office at Mercedes-Benz Motorsport Limited, Lauda Drive, Brackley, Northants NN13 7BD.
Please consider the environment before printing this email.