[slurm-users] MPI job fails with more than 1 node: "Failed to send temp kvs to compute nodes"

Cao, Lei leicao88124 at lanl.gov
Mon Jul 15 22:02:07 UTC 2019


Hi,


    I am running slurm version 19.05.0 and openmpi version 3.1.4. Openmpi is configured with pmi2 from slurm. Whenever I tried to run an mpi job with more than 1 node, I have this error message:

srun: error: mpi/pmi2: failed to send temp kvs to compute nodes
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

then the job just got killed off. If I only use 1 node, then the job will run as normal.

In my sbatch script I use srun --mpi=pmi2 mpi_job.

    Has anyone else encountered this problem but was able to fix it? Please help.

Thanks,
Lei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190715/36b8a4f6/attachment-0001.htm>


More information about the slurm-users mailing list