[slurm-users] Multi-node job failure

Chris Woelkers - NOAA Federal chris.woelkers at noaa.gov
Wed Dec 11 15:35:03 UTC 2019


I tried a simple thing of swapping out mpirun in the sbatch script for
srun. Nothing more, nothing less.
The model is now working on at least two nodes, I will have to test again
on more but this is progress.

Thanks,

Chris Woelkers
IT Specialist
National Oceanic and Atmospheric Agency
Great Lakes Environmental Research Laboratory
4840 S State Rd | Ann Arbor, MI 48108
734-741-2446


On Wed, Dec 11, 2019 at 10:17 AM Chris Woelkers - NOAA Federal <
chris.woelkers at noaa.gov> wrote:

> Thanks all for the ideas and possibilities. I will answer all in turn.
>
> Paul: Neither of the switches in use, Ethernet and Infiniband, have any
> form of broadcast storm protection enabled.
>
> Chris: I have passed on your question to the scientist that created
> the sbatch script. I will also look into other scripts that may make use of
> srun to find out if the same thing occurs.
>
> Jan-Albert: The mvapich2 package is provided by Bright and loaded as a
> module by the script before mpirun is executed.
>
> Zacarias: The drive that the data and script lives on is used is mounted
> on all the nodes at boot.
>
> Thanks,
>
> Chris Woelkers
> IT Specialist
> National Oceanic and Atmospheric Agency
> Great Lakes Environmental Research Laboratory
> 4840 S State Rd | Ann Arbor, MI 48108
> 734-741-2446
>
>
> On Wed, Dec 11, 2019 at 5:15 AM Zacarias Benta <zacarias at lip.pt> wrote:
>
>> I had a simmilar issue, please check if the home drive, or the place the
>> data should be stored is mounted on the nodes.
>>
>> On Tue, 2019-12-10 at 14:49 -0500, Chris Woelkers - NOAA Federal wrote:
>>
>> I have a 16 node HPC that is in the process of being upgraded from CentOS
>> 6 to 7. All nodes are diskless and connected via 1Gbps Ethernet and FDR
>> Infiniband. I am using Bright Cluster Management to manage it and their
>> support has not found a solution to this problem.
>> For the most part the cluster is up and running with all nodes booting
>> and able to communicate with each other via all interfaces on a basic level.
>> Test jobs, submitted via sbatch, are able to run on one node with no
>> problem but will not run on multiple nodes. The jobs are using mpirun and
>> mvapich2 is installed.
>> Any job trying to run on multiple nodes ends up timing out, as set via
>> -t, with no output data written and no error messages in the slurm.err or
>> slurm.out files. The job shows up in the squeue output and the nodes used
>> show up as allocated in the sinfo output.
>>
>> Thanks,
>>
>> Chris Woelkers
>> IT Specialist
>> National Oceanic and Atmospheric Agency
>> Great Lakes Environmental Research Laboratory
>> 4840 S State Rd | Ann Arbor, MI 48108
>> 734-741-2446
>>
>> --
>>
>> Cumprimentos / Best Regards,
>> Zacarias Benta
>> INCD @ LIP - UMinho
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191211/7d721dcd/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Image-IB3JUZ.png
Type: image/png
Size: 17131 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191211/7d721dcd/attachment-0001.png>


More information about the slurm-users mailing list