[slurm-users] Multi-node job failure

Ree, Jan-Albert van J.A.v.Ree at marin.nl
Wed Dec 11 07:54:07 UTC 2019


OK so OpenMPI works fine. That means SLURM, OFED and hardware are fine.

Which mvapich2 package are you using, a home built one or one provided by Bright ?


Regards,

--

Jan-Albert


Jan-Albert van Ree | Linux System Administrator | Digital Services
MARIN | T +31 317 49 35 48 | J.A.v.Ree at marin.nl<mailto:J.A.v.Ree at marin.nl> | www.marin.nl<http://www.marin.nl>

[LinkedIn]<https://www.linkedin.com/company/marin> [YouTube] <http://www.youtube.com/marinmultimedia>  [Twitter] <https://twitter.com/MARIN_nieuws>  [Facebook] <https://www.facebook.com/marin.wageningen>
MARIN news: FLARE holds first General Assembly Meeting in Bremen, Germany<https://www.marin.nl/flare-holds-first-general-assembly-meeting-in-bremen-germany>

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Chris Woelkers - NOAA Federal <chris.woelkers at noaa.gov>
Sent: Wednesday, December 11, 2019 01:11
To: Slurm User Community List
Subject: Re: [slurm-users] Multi-node job failure

Thanks for the reply and the things to try. Here are the answers to your questions/tests in order:

- I tried mpiexec and the same issue occurred.
- While the job is listed as running I checked all the nodes. None of them have processes spawned. I have no idea on the hydra process.
- I have version 4.7 of the OFED stack installed on all nodes.
- Using openmpi with the hello world example you listed to gives output that seems to match what should normally be given. I upped the number of threads to 16, because 4 doesn't help much, and ran it again with four nodes of 4 threads each, and got the following which looks like good output.
Hello world from processor bearnode14, rank 4 out of 16 processors
Hello world from processor bearnode14, rank 5 out of 16 processors
Hello world from processor bearnode14, rank 6 out of 16 processors
Hello world from processor bearnode15, rank 10 out of 16 processors
Hello world from processor bearnode15, rank 8 out of 16 processors
Hello world from processor bearnode16, rank 13 out of 16 processors
Hello world from processor bearnode15, rank 11 out of 16 processors
Hello world from processor bearnode13, rank 3 out of 16 processors
Hello world from processor bearnode14, rank 7 out of 16 processors
Hello world from processor bearnode15, rank 9 out of 16 processors
Hello world from processor bearnode16, rank 12 out of 16 processors
Hello world from processor bearnode16, rank 14 out of 16 processors
Hello world from processor bearnode16, rank 15 out of 16 processors
Hello world from processor bearnode13, rank 1 out of 16 processors
Hello world from processor bearnode13, rank 0 out of 16 processors
Hello world from processor bearnode13, rank 2 out of 16 processors
- I have not tested our test model with openmpi as it was compiled with Intel compilers and expects Intel MPI. It might work but for now I will hold that for later. I did test the hello world again using the Intel modules instead of the openmpi modules and it still worked.

Thanks,

Chris Woelkers


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191211/cb2756cb/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image6cb08d.PNG
Type: image/png
Size: 293 bytes
Desc: image6cb08d.PNG
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191211/cb2756cb/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image460e3d.PNG
Type: image/png
Size: 331 bytes
Desc: image460e3d.PNG
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191211/cb2756cb/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: imagedb0bf8.PNG
Type: image/png
Size: 333 bytes
Desc: imagedb0bf8.PNG
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191211/cb2756cb/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image55891e.PNG
Type: image/png
Size: 253 bytes
Desc: image55891e.PNG
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191211/cb2756cb/attachment-0007.png>


More information about the slurm-users mailing list