[slurm-users] Multi-node job failure
Chris Samuel
chris at csamuel.org
Wed Dec 11 05:07:58 UTC 2019
Hi Chris,
On Tuesday, 10 December 2019 11:49:44 AM PST Chris Woelkers - NOAA Federal
wrote:
> Test jobs, submitted via sbatch, are able to run on one node with no problem
> but will not run on multiple nodes. The jobs are using mpirun and mvapich2
> is installed.
Is there a reason why you aren't using srun for launching these?
https://slurm.schedmd.com/mpi_guide.html
If you're using mpirun then (unless you've built mvapich2 with Slurm support)
then you'll be relying on ssh to launch tasks and so that could be what's
broken for you. Running with srun will avoid that and allow Slurm to track
your processes correctly.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
More information about the slurm-users
mailing list