[slurm-users] problems with OpenMPI 4.0.3
Alberto Morillas, Angelines
angelines.alberto at ciemat.es
Fri May 29 10:14:06 UTC 2020
Good morning,
We have a cluster with two kind of infiniband cards, one connectx-4 and the other connectx-6.
Openmpi-3.1.3 works fine, but when we start with connectx-6 we started to use openmpi-4.0.3 (that support connectx-6) and the programs that have several parts, first a call to a secuencial program and inside it a call to a parallel program, … (in our case the program is WRF, but we have others like this with the same problem), this kind of programs suddenly stop,
…..
0 S 4556 87383 87361 0 80 0 - 126676 hrtime ? 00:05:25 real.exe
0 S 4556 87384 87361 0 80 0 - 126677 hrtime ? 00:05:33 real.exe
0 S 4556 87385 87361 0 80 0 - 126675 hrtime ? 00:05:28 real.exe
……
The WCHAN=hrtime, and it looks that it is running, but really it doesn´t work
We don´t know if it could be problem with slurm and this version of openmpi… Any idea?
________________________________________________
Angelines Alberto Morillas
Unidad de Arquitectura Informática
Despacho: 22.1.32
Telf.: +34 91 346 6119
Fax: +34 91 346 6537
skype: angelines.alberto
CIEMAT
Avenida Complutense, 40
28040 MADRID
________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200529/fa5b4269/attachment.htm>
More information about the slurm-users
mailing list