[slurm-users] Multi-node job failure
    Chris Samuel 
    chris at csamuel.org
       
    Wed Dec 11 05:07:58 UTC 2019
    
    
  
Hi Chris,
On Tuesday, 10 December 2019 11:49:44 AM PST Chris Woelkers - NOAA Federal 
wrote:
> Test jobs, submitted via sbatch, are able to run on one node with no problem
> but will not run on multiple nodes. The jobs are using mpirun and mvapich2
> is installed.
Is there a reason why you aren't using srun for launching these?
https://slurm.schedmd.com/mpi_guide.html
If you're using mpirun then (unless you've built mvapich2 with Slurm support) 
then you'll be relying on ssh to launch tasks and so that could be what's 
broken for you.  Running with srun will avoid that and allow Slurm to track 
your processes correctly.
All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
    
    
More information about the slurm-users
mailing list