[slurm-users] PMIx + openMPI with heterogeneous jobs

Bertini, Denis Dr. D.Bertini at gsi.de
Wed May 24 07:18:07 UTC 2023


I am facing the same problem that was quoted long ago (2019) in this mailing mailing reference:


https://lists.schedmd.com/pipermail/slurm-users/2019-July/003785.html


but with more recent version of slurm i.e:


slurm 21.08.8-2
PMIx 2.2.5 (pmix-2.2.5-1.el8.src.rpm)
openMPI 4.1.5

In  a similar way to my predecessor, running MPI heterogeneous jobs (OSU benchmarks) using this
slurm+PMIx version installed on the host gives sporadically this type of error

>>>
slurmstepd: error:  mpi/pmix_v2: _tcp_connect: lxbk1177 [0]: pmixp_dconn_tcp.c:139: Cannot establish the connection
slurmstepd: error:  mpi/pmix_v2: pmixp_dconn_connect: lxbk1177 [0]: pmixp_dconn.h:246: Cannot establish direct connection to lxbk1177 (0)
slurmstepd: error:  mpi/pmix_v2: _process_extended_hdr: lxbk1177 [0]: pmixp_server.c:738: Unable to connect to 0
slurmstepd: error:  mpi/pmix_v2: pmixp_coll_ring_check: lxbk1177 [0]: pmixp_coll_ring.c:618: 0x14cd84047ab0: unexpected contrib from lxbk1177:0, expected is 1
slurmstepd: error:  mpi/pmix_v2: _process_server_request: lxbk1177 [0]: pmixp_server.c:942: 0x14cd84047ab0: unexpected contrib from lxbk1177:0, coll->seq=0, seq=0
>>>

So very similar problem indeed.
Additionally when the jobs completes, from time to time it cannot finish properly and stay in RUNNING state an one needs to manually
cancel the job.

Is the hetjob functionality really supporting this case?
If yes, any ideas what can be wrong here?



Job submission details:
==================


- submit script:

sbatch --ntasks 1 --ntasks-per-core 1 --cpus-per-task 2   -p main  -D ./data -o %j.out.log -e %j.err.log : --ntasks 1 --ntasks-per-core 1 --cpus-per-task 1  -p main  -D ./data -o %j.out.log -e %j.err.log  ./run-file.sh



- run-file.sh:



export CONT=<std_singularity_container>.sif

srun  -vv --mpi=pmix --export=ALL : $CONT collective/osu_allreduce -f -i 100 -x 10




---------
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a

Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail: d.bertini at gsi.de

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
Ministerialdirigent Dr. Volkmar Dietz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230524/a4a22ce6/attachment.htm>


More information about the slurm-users mailing list