[slurm-users] Multi-node job failure

Wed Dec 11 10:04:54 UTC 2019

I had a simmilar issue, please check if the home drive, or the place
the data should be stored is mounted on the nodes.
On Tue, 2019-12-10 at 14:49 -0500, Chris Woelkers - NOAA Federal wrote:
> I have a 16 node HPC that is in the process of being upgraded from
> CentOS 6 to 7. All nodes are diskless and connected via 1Gbps
> Ethernet and FDR Infiniband. I am using Bright Cluster Management to
> manage it and their support has not found a solution to this
> problem.For the most part the cluster is up and running with all
> nodes booting and able to communicate with each other via all
> interfaces on a basic level.
> Test jobs, submitted via sbatch, are able to run on one node with no
> problem but will not run on multiple nodes. The jobs are using mpirun
> and mvapich2 is installed.
> Any job trying to run on multiple nodes ends up timing out, as set
> via -t, with no output data written and no error messages in the
> slurm.err or slurm.out files. The job shows up in the squeue output
> and the nodes used show up as allocated in the sinfo output.
> 
> Thanks,
> 
> Chris Woelkers
> IT Specialist
> National Oceanic and Atmospheric Agency
> Great Lakes Environmental Research Laboratory
> 4840 S State Rd | Ann Arbor, MI 48108
> 734-741-2446
-- 
Cumprimentos / Best Regards,
Zacarias Benta
INCD @ LIP - UMinho
  

     	
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191211/f9b8820c/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Image-IB3JUZ.png
Type: image/png
Size: 17131 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191211/f9b8820c/attachment.png>