[slurm-users] problems with OpenMPI 4.0.3

Alberto Morillas, Angelines angelines.alberto at ciemat.es
Fri May 29 10:14:06 UTC 2020


Good morning,

We have a cluster with two kind of infiniband cards, one connectx-4 and the other connectx-6.
Openmpi-3.1.3 works fine, but when we start with connectx-6 we started to use openmpi-4.0.3 (that support connectx-6) and the programs that have several parts, first a call to a secuencial program and inside it a call to a parallel program, … (in our case the program is WRF, but we have others like this with the same problem),  this kind of programs suddenly stop,

…..
0 S  4556  87383  87361  0  80   0 - 126676 hrtime ?       00:05:25 real.exe
0 S  4556  87384  87361  0  80   0 - 126677 hrtime ?       00:05:33 real.exe
0 S  4556  87385  87361  0  80   0 - 126675 hrtime ?       00:05:28 real.exe
……
The WCHAN=hrtime, and it looks that it is running, but really it doesn´t work

We don´t know if it could be  problem with slurm and this version of openmpi… Any idea?


________________________________________________

Angelines Alberto Morillas

Unidad de Arquitectura Informática
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax:   +34 91 346 6537

skype: angelines.alberto

CIEMAT
Avenida Complutense, 40
28040 MADRID
________________________________________________


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200529/fa5b4269/attachment.htm>


More information about the slurm-users mailing list