[slurm-users] Multi-node job failure

Tue Dec 10 19:49:44 UTC 2019

I have a 16 node HPC that is in the process of being upgraded from CentOS 6
to 7. All nodes are diskless and connected via 1Gbps Ethernet and FDR
Infiniband. I am using Bright Cluster Management to manage it and their
support has not found a solution to this problem.
For the most part the cluster is up and running with all nodes booting and
able to communicate with each other via all interfaces on a basic level.
Test jobs, submitted via sbatch, are able to run on one node with no
problem but will not run on multiple nodes. The jobs are using mpirun and
mvapich2 is installed.
Any job trying to run on multiple nodes ends up timing out, as set via -t,
with no output data written and no error messages in the slurm.err or
slurm.out files. The job shows up in the squeue output and the nodes used
show up as allocated in the sinfo output.

Thanks,

Chris Woelkers
IT Specialist
National Oceanic and Atmospheric Agency
Great Lakes Environmental Research Laboratory
4840 S State Rd | Ann Arbor, MI 48108
734-741-2446
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191210/a4980bc3/attachment.htm>