[slurm-users] PMix3 Plugin+ openMPI 4.1.5 broken for heterogenous jobs with SLURM v 21.08.8-2

Bertini, Denis Dr. D.Bertini at gsi.de
Tue Jun 20 06:37:13 UTC 2023


Hi

I made some progress trying to understand the problem i reported some weeks ago:


https://lists.schedmd.com/pipermail/slurm-users/2023-May/010027.html


I noticed that the intermittent connection timeout that i am experiencing occurs only

when using the tcp based direct connection to establish communication between stepd

on different nodes.

When disabling the optimized direct connection using


export SLURM_PMIX_DIRECT_CONN=false


the submission of hetjobs is stable and not

connection timeout occurs anymore.

Any idea what can goes wrong when using tcp based direct connection together with hetjobs?

Cheers,
Denis

---------
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a

Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail: d.bertini at gsi.de

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
Ministerialdirigent Dr. Volkmar Dietz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230620/25abb4a2/attachment.htm>


More information about the slurm-users mailing list