[slurm-users] Multi-node job failure

Chris Samuel chris at csamuel.org
Wed Dec 11 05:07:58 UTC 2019


Hi Chris,

On Tuesday, 10 December 2019 11:49:44 AM PST Chris Woelkers - NOAA Federal 
wrote:

> Test jobs, submitted via sbatch, are able to run on one node with no problem
> but will not run on multiple nodes. The jobs are using mpirun and mvapich2
> is installed.

Is there a reason why you aren't using srun for launching these?

https://slurm.schedmd.com/mpi_guide.html

If you're using mpirun then (unless you've built mvapich2 with Slurm support) 
then you'll be relying on ssh to launch tasks and so that could be what's 
broken for you.  Running with srun will avoid that and allow Slurm to track 
your processes correctly.

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






More information about the slurm-users mailing list