[slurm-users] Slurm and MPI_Comm_spawn error
Mccall, Kurt E. (MSFC-EV41)
kurt.e.mccall at nasa.gov
Fri Jan 7 20:11:43 UTC 2022
My Slurm/MPICH job is failing when it calls the MPI function MPI_Comm_spawn. That function is specifically mentioned in the Slurm MPI guide https://slurm.schedmd.com/mpi_guide.html#mpich2<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fmpi_guide.html%23mpich2&data=04%7C01%7Ckurt.e.mccall%40nasa.gov%7C03a66726fdb847e6896e08d9d217f8bc%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637771822668037279%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2Bp0cRhC%2FxfGh3dL9Ml5aFQRkIVumdkmr4wshxRW1UJ8%3D&reserved=0>, where it states
MPICH2 with mpiexec
Do not add any flags to mpich and build the default (e.g. "./configure -prefix ... ". Do NOT pass the --with-slurm, --with-pmi, --enable-pmiport options).
Do not add -lpmi to your application (it will force slurm's pmi 1 interface which doesn't support PMI_Spawn_multiple).
Launch the application using salloc to create the job allocation and mpiexec to launch the tasks. A simple example is shown below.
salloc -N 2 mpiexec my_application
All MPI_comm_spawn work fine now going through hydra's PMI 1.1 interface.
The way this is worded, it sounds like a bug involving MPI_Comm_spawn was fixed at some point. Does anyone remember if this is the case, and if the fix happened after 20.02.3? I'm wondering if stepping up to a more recent version of Slurm would solve this problem.
Thanks,
Kurt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220107/32014188/attachment.htm>
More information about the slurm-users
mailing list