<div dir="ltr"><div>You seem to use a very old OMPI implementation (the current one is 3.0). So I'd suggest to try it if you can.</div><div>And it seem like a pure OMPI problem so OMPI dev list may be more appropriate for this topic.</div><div><br></div><div class="gmail_extra"><br><div class="gmail_quote">2017-12-07 12:53 GMT-08:00 Glenn (Gedaliah) Wolosh <span dir="ltr"><<a href="mailto:gwolosh@njit.edu" target="_blank">gwolosh@njit.edu</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><br>
<br><div><span class=""><blockquote type="cite"><div>On Dec 7, 2017, at 3:26 PM, Artem Polyakov <<a href="mailto:artpol84@gmail.com" target="_blank">artpol84@gmail.com</a>> wrote:</div><br class="m_-5656203663092894886Apple-interchange-newline"><div><div><div dir="auto">Given that ring is working I don't think that it's a PMI problem.</div><div dir="auto"><br></div><div dir="auto">Can you try running NPB with the tcp btl parameters that I've provided? (I assume you have TCP interconnect, let me know if it's not a case).</div></div></div></blockquote><blockquote type="cite"><div><div><br><div class="gmail_quote"><div>чт, 7 дек. 2017 г. в 12:03, Glenn (Gedaliah) Wolosh <<a href="mailto:gwolosh@njit.edu" target="_blank">gwolosh@njit.edu</a>>:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div><blockquote type="cite"><div>On Dec 7, 2017, at 1:18 PM, Artem Polyakov <<a href="mailto:artpol84@gmail.com" target="_blank">artpol84@gmail.com</a>> wrote:</div><br class="m_-5656203663092894886m_-4122188114053557769Apple-interchange-newline"><div><div>Couple of things to try to locate the issue:<div><br></div><div>1. To make sure that PMI is not working: have you tried to run something simple (like hello_world (<a href="https://github.com/open-mpi/ompi/blob/master/examples/hello_c.c" target="_blank">https://github.com/open-mpi/<wbr>ompi/blob/master/examples/<wbr>hello_c.c</a>) and ring (<a href="https://github.com/open-mpi/ompi/blob/master/examples/ring_c.c" target="_blank">https://github.com/open-mpi/<wbr>ompi/blob/master/examples/<wbr>ring_c.c</a>). Please try to run those two and post the results.</div><div>2. If hello is working and ring is not can you try to change the fabric to TCP: </div><div>$ export OMPI_MCA_btl=tcp,self</div><div>$ export OMPI_MCA_pml=ob1</div><div>$ srun ...</div><div><br></div><div>Please provide the outputs</div></div></div></blockquote></div></div></blockquote></div></div></div></blockquote><div><br></div><div><br></div></span><div>export OMPI_MCA_btl=tcp,self</div><div>export OMPI_MCA_pml=ob1</div><div><br></div><div>srun --nodes=8 --ntasks-per-node=8 --ntasks=64 --mpi=pmi2 ./ep.C.64</div><div><br></div><div>This works —</div><div><br></div><div><div>AS Parallel Benchmarks 3.3 -- EP Benchmark</div><div><br></div><div> Number of random numbers generated: <a href="tel:(858)%20993-4592" value="+18589934592" target="_blank">8589934592</a></div><div> Number of active processes: 64</div><div><br></div><div>EP Benchmark Results:</div><div><br></div><div>CPU Time = 5.9208</div><div>N = 2^ 32</div><div>No. Gaussian Pairs = <a href="tel:(337)%20327-5903" value="+13373275903" target="_blank">3373275903</a>.</div><div>Sums = 4.764367927992081D+04 -8.084072988045549D+04</div><div>Counts:</div><div> 0 1572172634.</div><div> 1 1501108549.</div><div> 2 281805648.</div><div> 3 17761221.</div><div> 4 424017.</div><div> 5 3821.</div><div> 6 13.</div><div> 7 0.</div><div> 8 0.</div><div> 9 0.</div><div><br></div><div><br></div><div> EP Benchmark Completed.</div><div> Class = C</div><div> Size = <a href="tel:(858)%20993-4592" value="+18589934592" target="_blank">8589934592</a></div><div> Iterations = 0</div><div> Time in seconds = 5.92</div><div> Total processes = 64</div><div> Compiled procs = 64</div><div> Mop/s total = 1450.82</div><div> Mop/s/process = 22.67</div><div> Operation type = Random numbers generated</div><div> Verification = SUCCESSFUL</div><div> Version = 3.3.1</div><div> Compile date = 07 Dec 2017</div><div><br></div><div> Compile options:</div><div> MPIF77 = mpif77</div><div> FLINK = $(MPIF77)</div><div><div> FMPI_LIB = -L/opt/local/easybuild/<wbr>software/Compiler/GC...</div><div> FMPI_INC = -I/opt/local/easybuild/<wbr>software/Compiler/GC...</div><div> FFLAGS = -O</div><div> FLINKFLAGS = -O</div><div> RAND = randi8</div><div><br></div><div><br></div><div> Please send feedbacks and/or the results of this run to:</div><div><br></div><div> NPB Development Team</div><div> Internet: <a href="mailto:npb@nas.nasa.gov" target="_blank">npb@nas.nasa.gov</a></div><div><br></div></div><div>Hmm...</div></div><div><div class="h5"><br><blockquote type="cite"><div><div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div><div>srun --mpi=pmi2 --ntasks-per-node=8 --ntasks=16 ./hello_c > hello_c.out</div><div><br></div><div><div>Hello, world, I am 24 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 0 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 25 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 1 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 27 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 2 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 29 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 31 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 30 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 4 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 5 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 17 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 3 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 7 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 6 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 18 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 22 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 23 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 19 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 9 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 20 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 8 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 10 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 13 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 11 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 26 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 16 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 14 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 28 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 21 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 15 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div>Hello, world, I am 12 of 32, (Open MPI v1.10.3, package: Open MPI <a href="mailto:gwolosh@snode2.p-stheno.tartan.njit.edu" target="_blank">gwolosh@snode2.p-stheno.<wbr>tartan.njit.edu</a> Distribution, ident: 1.10.3, repo rev: v1.10.2-251-g9acf492, Jun 14, 2016, 150)</div><div><br></div><div> srun --mpi=pmi2 --ntasks-per-node=8 --ntasks=16 --nodes=2 ./ring_c > ring_c.out</div><div><br></div><div><div>Process 1 exiting</div><div>Process 12 exiting</div><div>Process 14 exiting</div><div>Process 13 exiting</div><div>Process 3 exiting</div><div>Process 11 exiting</div><div>Process 5 exiting</div><div>Process 6 exiting</div><div>Process 2 exiting</div><div>Process 4 exiting</div><div>Process 9 exiting</div><div>Process 10 exiting</div><div>Process 7 exiting</div><div>Process 15 exiting</div><div>Process 0 sending 10 to 1, tag 201 (16 processes in ring)</div><div>Process 0 sent to 1</div><div>Process 0 decremented value: 9</div><div>Process 0 decremented value: 8</div><div>Process 0 decremented value: 7</div><div>Process 0 decremented value: 6</div><div>Process 0 decremented value: 5</div><div>Process 0 decremented value: 4</div><div>Process 0 decremented value: 3</div><div>Process 0 decremented value: 2</div><div>Process 0 decremented value: 1</div><div>Process 0 decremented value: 0</div><div>Process 0 exiting</div><div>Process 8 exiting</div></div></div></div></div><div style="word-wrap:break-word"><div><br><blockquote type="cite"><div><div class="gmail_extra"><br><div class="gmail_quote">2017-12-07 10:05 GMT-08:00 Glenn (Gedaliah) Wolosh <span><<a href="mailto:gwolosh@njit.edu" target="_blank">gwolosh@njit.edu</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><br>
<br><div><span><blockquote type="cite"><div>On Dec 7, 2017, at 12:51 PM, Artem Polyakov <<a href="mailto:artpol84@gmail.com" target="_blank">artpol84@gmail.com</a>> wrote:</div><br class="m_-5656203663092894886m_-4122188114053557769m_6041386738242103588Apple-interchange-newline"><div><div>also please post the output of<div>$ srun --mpi=list</div></div></div></blockquote><div><br></div></span><div><div>[gwolosh@p-slogin bin]$ srun --mpi=list</div><div>srun: MPI types are...</div><div>srun: mpi/mpich1_shmem</div><div>srun: mpi/mpich1_p4</div><div>srun: mpi/lam</div><div>srun: mpi/openmpi</div><div>srun: mpi/none</div><div>srun: mpi/mvapich</div><div>srun: mpi/mpichmx</div><div>srun: mpi/pmi2</div><div>srun: mpi/mpichgm</div></div><span><div><br></div><br><blockquote type="cite"><div><div><div><br></div><div>When job crashes - is there any error messages in the relevant slurmd.log's or output on the screen?</div></div></div></blockquote><div><br></div></span><div>on screen —</div><div><br></div><div><div>[snode4][[274,1],24][connect/<wbr>btl_openib_connect_udcm.c:<wbr>1448:udcm_wait_for_send_<wbr>completion] send failed with verbs status 2</div><div>[snode4:5175] *** An error occurred in MPI_Bcast</div><div>[snode4:5175] *** reported by process [17956865,24]</div><div>[snode4:5175] *** on communicator MPI_COMM_WORLD</div><div>[snode4:5175] *** MPI_ERR_OTHER: known error not in list</div><div>[snode4:5175] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,</div><div>[snode4:5175] *** and potentially your MPI job)</div><div>mlx4: local QP operation err (QPN 0005f3, WQE index 40000, vendor syndrome 6c, opcode = 5e)</div><div>srun: Job step aborted: Waiting up to 32 seconds for job step to finish.</div><div>[snode4][[274,1],31][connect/<wbr>btl_openib_connect_udcm.c:<wbr>1448:udcm_wait_for_send_<wbr>completion] send failed with verbs status 2</div><div>slurmstepd: error: *** STEP 274.0 ON snode1 CANCELLED AT 2017-12-07T12:55:46 ***</div><div>[snode4:5182] *** An error occurred in MPI_Bcast</div><div>[snode4:5182] *** reported by process [17956865,31]</div><div>[snode4:5182] *** on communicator MPI_COMM_WORLD</div><div>[snode4:5182] *** MPI_ERR_OTHER: known error not in list</div><div>[snode4:5182] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,</div><div>[snode4:5182] *** and potentially your MPI job)</div><div>mlx4: local QP operation err (QPN 0005f7, WQE index 40000, vendor syndrome 6c, opcode = 5e)</div><div>[snode4][[274,1],27][connect/<wbr>btl_openib_connect_udcm.c:<wbr>1448:udcm_wait_for_send_<wbr>completion] send failed with verbs status 2</div><div>[snode4:5178] *** An error occurred in MPI_Bcast</div><div>[snode4:5178] *** reported by process [17956865,27]</div><div>[snode4:5178] *** on communicator MPI_COMM_WORLD</div><div>[snode4:5178] *** MPI_ERR_OTHER: known error not in list</div><div>[snode4:5178] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,</div><div>[snode4:5178] *** and potentially your MPI job)</div><div>mlx4: local QP operation err (QPN 0005fa, WQE index 40000, vendor syndrome 6c, opcode = 5e)</div><div>srun: error: snode4: tasks 24,31: Exited with exit code 16</div><div>srun: error: snode4: tasks 25-30: Killed</div><div>srun: error: snode5: tasks 32-39: Killed</div><div>srun: error: snode3: tasks 16-23: Killed</div><div>srun: error: snode8: tasks 56-63: Killed</div><div>srun: error: snode7: tasks 48-55: Killed</div><div>srun: error: snode1: tasks 0-7: Killed</div><div>srun: error: snode2: tasks 8-15: Killed</div><div>srun: error: snode6: tasks 40-47: Killed</div><div><br></div><div>Nothing striking in the slurmd log</div><div><br></div></div><div><div class="m_-5656203663092894886m_-4122188114053557769h5"><br><blockquote type="cite"><div><div class="gmail_extra"><br><div class="gmail_quote">2017-12-07 9:49 GMT-08:00 Artem Polyakov <span><<a href="mailto:artpol84@gmail.com" target="_blank">artpol84@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>Hello,<div><br></div><div>what is the value of MpiDefault option in your Slurm configuration file?</div></div><div class="gmail_extra"><div><div class="m_-5656203663092894886m_-4122188114053557769m_6041386738242103588h5"><br><div class="gmail_quote">2017-12-07 9:37 GMT-08:00 Glenn (Gedaliah) Wolosh <span><<a href="mailto:gwolosh@njit.edu" target="_blank">gwolosh@njit.edu</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div>Hello</div><div><br></div><div>This is using Slurm version - 17.02.6 running on Scientific Linux release 7.4 (Nitrogen)</div><div><br></div><div><div>[gwolosh@p-slogin bin]$ module li</div><div><br></div><div>Currently Loaded Modules:</div><div> 1) GCCcore/.5.4.0 (H) 2) binutils/.2.26 (H) 3) GCC/5.4.0-2.26 4) numactl/2.0.11 5) hwloc/1.11.3 6) OpenMPI/1.10.3</div></div><div><br></div><div>If I run</div><div><br></div><div>srun --nodes=8 --ntasks-per-node=8 --ntasks=64 ./ep.C.64</div><div><br></div><div>It runs successfuly but I get a message —</div><div><br></div><div><div>PMI2 initialized but returned bad values for size/rank/jobid.</div><div>This is symptomatic of either a failure to use the</div><div>"--mpi=pmi2" flag in SLURM, or a borked PMI2 installation.</div><div>If running under SLURM, try adding "-mpi=pmi2" to your</div><div>srun command line. If that doesn't work, or if you are</div><div>not running under SLURM, try removing or renaming the</div><div>pmi2.h header file so PMI2 support will not automatically</div><div>be built, reconfigure and build OMPI, and then try again</div><div>with only PMI1 support enabled.</div><div><br></div><div>If I run</div><div><br></div><div>srun --nodes=8 --ntasks-per-node=8 --ntasks=64 —mpi=pmi2 ./ep.C.64</div><div><br></div><div>The job crashes</div><div><br></div><div>If I run via sbatch —</div><div><br></div><div><div>#!/bin/bash</div><div># Job name:</div><div>#SBATCH --job-name=nas_bench</div><div>#SBATCH --nodes=8</div><div>#SBATCH --ntasks=64</div><div>#SBATCH --ntasks-per-node=8</div><div>#SBATCH --time=48:00:00</div><div>#SBATCH --output=nas.out.1</div><div>#</div><div>## Command(s) to run (example):</div><div>module use $HOME/easybuild/modules/all/<wbr>Core</div><div>module load GCC/5.4.0-2.26 OpenMPI/1.10.3</div><div>mpirun -np 64 ./ep.C.64</div></div><div><br></div><div>the job crashes</div><div><br></div><div>Using easybuild, these are my config options for ompi —</div><div><br></div><div><div>configopts = '--with-threads=posix --enable-shared --enable-mpi-thread-multiple --with-verbs '</div><div>configopts += '--enable-mpirun-prefix-by-<wbr>default ' # suppress failure modes in relation to mpirun path</div><div>configopts += '--with-hwloc=$EBROOTHWLOC ' # hwloc support</div><div>configopts += '--disable-dlopen ' # statically link component, don't do dynamic loading</div><div>configopts += '--with-slurm --with-pmi ‘</div></div><div><br></div><div>And finally —</div><div><br></div><div><div>$ ldd /opt/local/easybuild/software/<wbr>Compiler/GCC/5.4.0-2.26/<wbr>OpenMPI/1.10.3/bin/orterun | grep pmi</div><div> libpmi.so.0 => /usr/lib64/libpmi.so.0 (0x00007f0129d6d000)</div><div> libpmi2.so.0 => /usr/lib64/libpmi2.so.0 (0x00007f0129b51000)</div></div><div><br></div><div><div>$ ompi_info | grep pmi</div><div> MCA db: pmi (MCA v2.0.0, API v1.0.0, Component v1.10.3)</div><div> MCA ess: pmi (MCA v2.0.0, API v3.0.0, Component v1.10.3)</div><div> MCA grpcomm: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3)</div><div> MCA pubsub: pmi (MCA v2.0.0, API v2.0.0, Component v1.10.3)</div></div><div><br></div><div><br></div><div>Any suggestions?</div></div><div>
<div style="letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;word-wrap:break-word"><div style="letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;word-wrap:break-word">_______________<br>Gedaliah Wolosh<br>IST Academic and Research Computing Systems (ARCS)<br>NJIT<br>GITC 2203<br><a href="tel:(973)%20596-5437" value="+19735965437" target="_blank">973 596 5437</a><br><a href="mailto:gwolosh@njit.edu" target="_blank">gwolosh@njit.edu</a><br></div></div>
</div>
<br></div></blockquote></div><br><br clear="all"><div><br></div></div></div><span class="m_-5656203663092894886m_-4122188114053557769m_6041386738242103588HOEnZb"><font color="#888888">-- <br><div class="m_-5656203663092894886m_-4122188114053557769m_6041386738242103588m_5736140807596716564gmail_signature" data-smartmail="gmail_signature">С Уважением, Поляков Артем Юрьевич<br>Best regards, Artem Y. Polyakov</div>
</font></span></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="m_-5656203663092894886m_-4122188114053557769m_6041386738242103588gmail_signature" data-smartmail="gmail_signature">С Уважением, Поляков Артем Юрьевич<br>Best regards, Artem Y. Polyakov</div>
</div>
</div></blockquote></div></div></div><br></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="m_-5656203663092894886m_-4122188114053557769gmail_signature" data-smartmail="gmail_signature">С Уважением, Поляков Артем Юрьевич<br>Best regards, Artem Y. Polyakov</div>
</div>
</div></blockquote></div></div></blockquote></div></div><div dir="ltr">-- <br></div><div class="m_-5656203663092894886gmail_signature" data-smartmail="gmail_signature">-----
Best regards, Artem Polyakov
(Mobile mail)</div>
</div></blockquote></div></div></div><br></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature">С Уважением, Поляков Артем Юрьевич<br>Best regards, Artem Y. Polyakov</div>
</div></div>