[slurm-users] Multi-node job failure

Chris Samuel chris at csamuel.org
Fri Dec 13 06:15:10 UTC 2019


On 11/12/19 8:05 am, Chris Woelkers - NOAA Federal wrote:

> Partial progress. The scientist that developed the model took a look at 
> the output and found that instead of one model run being ran in parallel 
> srun had ran multiple instances of the model, one per thread, which for 
> this test was 110 threads.

This sounds like MVAPICH isn't built to support Slurm, from the Slurm 
MPI guide you need to build it with this to enable Slurm support (and of 
course add any other options you were using):

./configure --with-pmi=pmi2 --with-pm=slurm

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



More information about the slurm-users mailing list