[slurm-users] PMix3 Plugin+ openMPI 4.1.5 broken for heterogenous jobs with SLURM v 21.08.8-2
Bertini, Denis Dr.
D.Bertini at gsi.de
Tue Jun 20 06:37:13 UTC 2023
Hi
I made some progress trying to understand the problem i reported some weeks ago:
https://lists.schedmd.com/pipermail/slurm-users/2023-May/010027.html
I noticed that the intermittent connection timeout that i am experiencing occurs only
when using the tcp based direct connection to establish communication between stepd
on different nodes.
When disabling the optimized direct connection using
export SLURM_PMIX_DIRECT_CONN=false
the submission of hetjobs is stable and not
connection timeout occurs anymore.
Any idea what can goes wrong when using tcp based direct connection together with hetjobs?
Cheers,
Denis
---------
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a
Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail: d.bertini at gsi.de
GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de
Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
Ministerialdirigent Dr. Volkmar Dietz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230620/25abb4a2/attachment.htm>
More information about the slurm-users
mailing list