<html><head></head><body><div class="ydpe0a6a916yahoo-style-wrap" style="font-family: Helvetica Neue, Helvetica, Arial, sans-serif; font-size: 13px;"><div></div>
<div dir="ltr" data-setdir="false">Works here on slurm 18.08.8, pmix 3.1.2. The mpi world ranks are unified as they should be.</div><div dir="ltr" data-setdir="false"><br></div><div dir="ltr" data-setdir="false"><div><div>$ srun --mpi=pmix -n2 -wathos ./hello : -n8 -wporthos ./hello</div><div>srun: job 586 queued and waiting for resources</div><div>srun: job 586 has been allocated resources</div><div>Hello world from processor athos, rank 1 out of 10 processors</div><div>Hello world from processor athos, rank 0 out of 10 processors</div><div>Hello world from processor porthos, rank 2 out of 10 processors</div><div>Hello world from processor porthos, rank 7 out of 10 processors</div><div>Hello world from processor porthos, rank 3 out of 10 processors</div><div>Hello world from processor porthos, rank 4 out of 10 processors</div><div>Hello world from processor porthos, rank 6 out of 10 processors</div><div>Hello world from processor porthos, rank 8 out of 10 processors</div><div>Hello world from processor porthos, rank 9 out of 10 processors</div><div>Hello world from processor porthos, rank 5 out of 10 processors</div><div><br></div></div><br></div>
</div><div id="yahoo_quoted_3442851970" class="yahoo_quoted">
<div style="font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;font-size:13px;color:#26282a;">
<div>
On Tuesday, July 16, 2019, 09:49:59 AM EDT, Mehlberg, Steve <steve.mehlberg@atos.net> wrote:
</div>
<div><br></div>
<div><br></div>
<div><div id="yiv5886144848">
<style><!--
#yiv5886144848
_filtered #yiv5886144848 {font-family:"Cambria Math";panose-1:2 4 5 3 5 4 6 3 2 4;}
_filtered #yiv5886144848 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}
#yiv5886144848
#yiv5886144848 p.yiv5886144848MsoNormal, #yiv5886144848 li.yiv5886144848MsoNormal, #yiv5886144848 div.yiv5886144848MsoNormal
{margin:0in;margin-bottom:.0001pt;font-size:11.0pt;font-family:"Calibri", sans-serif;}
#yiv5886144848 a:link, #yiv5886144848 span.yiv5886144848MsoHyperlink
{color:blue;text-decoration:underline;}
#yiv5886144848 a:visited, #yiv5886144848 span.yiv5886144848MsoHyperlinkFollowed
{color:purple;text-decoration:underline;}
#yiv5886144848 span.yiv5886144848EmailStyle17
{font-family:"Calibri", sans-serif;color:windowtext;}
#yiv5886144848 .yiv5886144848MsoChpDefault
{font-family:"Calibri", sans-serif;}
_filtered #yiv5886144848 {margin:1.0in 1.0in 1.0in 1.0in;}
#yiv5886144848 div.yiv5886144848WordSection1
{}
--></style>
<div>
<div class="yiv5886144848WordSection1">
<p class="yiv5886144848MsoNormal">Has anyone been able to run an MPI job using PMIX and heterogeneous jobs successfully with 19.05 (or even 18.08)? I can run without heterogeneous jobs but get all sorts of errors when I try and split the job up.
</p>
<p class="yiv5886144848MsoNormal">I haven’t used MPI/PMIX much so maybe I’m missing something? Any ideas?
</p>
<p class="yiv5886144848MsoNormal"> </p>
<p class="yiv5886144848MsoNormal"><span lang="ES">[slurm@trek8 mpihello]$ sinfo -V</span></p>
<p class="yiv5886144848MsoNormal"><span lang="ES">slurm 19.05.1</span></p>
<p class="yiv5886144848MsoNormal"><span lang="ES"> </span></p>
<p class="yiv5886144848MsoNormal">[slurm@trek8 mpihello]$ which mpicc</p>
<p class="yiv5886144848MsoNormal">/opt/openmpi/4.0.1/bin/mpicc</p>
<p class="yiv5886144848MsoNormal"></p>
<p class="yiv5886144848MsoNormal"><span lang="ES">[slurm@trek8 mpihello]$ sudo yum list pmix</span></p>
<p class="yiv5886144848MsoNormal">Loaded plugins: langpacks, product-id, search-disabled-repos, subscription-manager</p>
<p class="yiv5886144848MsoNormal">Installed Packages</p>
<p class="yiv5886144848MsoNormal">pmix.x86_64 3.1.2rc1.debug-1.el7 installed
</p>
<p class="yiv5886144848MsoNormal"> </p>
<p class="yiv5886144848MsoNormal"><span lang="ES">[slurm@trek8 mpihello]$ mpicc mpihello.c -o mpihh</span></p>
<p class="yiv5886144848MsoNormal"><span lang="ES"> </span></p>
<p class="yiv5886144848MsoNormal"><span lang="ES">[slurm@trek8 mpihello]$ srun -w trek[8-12] -n5 --mpi=pmix mpihh | sort</span></p>
<p class="yiv5886144848MsoNormal">Hello world, I am 0 of 5 - running on trek8</p>
<p class="yiv5886144848MsoNormal">Hello world, I am 1 of 5 - running on trek9</p>
<p class="yiv5886144848MsoNormal">Hello world, I am 2 of 5 - running on trek10</p>
<p class="yiv5886144848MsoNormal">Hello world, I am 3 of 5 - running on trek11</p>
<p class="yiv5886144848MsoNormal">Hello world, I am 4 of 5 - running on trek12</p>
<p class="yiv5886144848MsoNormal"> </p>
<p class="yiv5886144848MsoNormal">[slurm@trek8 mpihello]$ srun -w trek8 --mpi=pmix : -w trek9 mpihh</p>
<p class="yiv5886144848MsoNormal">srun: job 753 queued and waiting for resources</p>
<p class="yiv5886144848MsoNormal">srun: job 753 has been allocated resources</p>
<p class="yiv5886144848MsoNormal">srun: error: (null) [0] mpi_pmix.c:228 [p_mpi_hook_client_prelaunch] mpi/pmix: ERROR: ot create process mapping</p>
<p class="yiv5886144848MsoNormal">srun: error: Application launch failed: MPI plugin's pre-launch setup failed</p>
<p class="yiv5886144848MsoNormal">srun: Job step aborted: Waiting up to 32 seconds for job step to finish.</p>
<p class="yiv5886144848MsoNormal">srun: error: Timed out waiting for job step to complete</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_utils.c:457 [pmixp_p2p_send] mpi/pmix: ERROR: send ed, rc=2, exceeded the retry limit</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_server.c:1493 [_slurm_send] mpi/pmix: ERROR: Cannotd message to /var/tmp/sgm-slurm/slurmd.spool/stepd.slurm.pmix.753.0, size = 649, hostl</p>
<p class="yiv5886144848MsoNormal">(null)</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_ring.c:738 [pmixp_coll_ring_reset_if_to] mpi/p ERROR: 0x7f4f5c016050: collective timeout seq=0</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll.c:281 [pmixp_coll_log] mpi/pmix: ERROR: Dumpinllective state</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_ring.c:756 [pmixp_coll_ring_log] mpi/pmix: ERR0x7f4f5c016050: COLL_FENCE_RING state seq=0</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_ring.c:758 [pmixp_coll_ring_log] mpi/pmix: ERRmy peerid: 0:trek8</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_ring.c:765 [pmixp_coll_ring_log] mpi/pmix: ERRneighbor id: next 1:trek9, prev 1:trek9</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_ring.c:775 [pmixp_coll_ring_log] mpi/pmix: ERRContext ptr=0x7f4f5c0160d0, #0, in-use=0</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_ring.c:775 [pmixp_coll_ring_log] mpi/pmix: ERRContext ptr=0x7f4f5c016108, #1, in-use=0</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_ring.c:775 [pmixp_coll_ring_log] mpi/pmix: ERRContext ptr=0x7f4f5c016140, #2, in-use=1</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_ring.c:786 [pmixp_coll_ring_log] mpi/pmix: ERR seq=0 contribs: loc=1/prev=0/fwd=1</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_ring.c:788 [pmixp_coll_ring_log] mpi/pmix: ERR neighbor contribs [2]:</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_ring.c:821 [pmixp_coll_ring_log] mpi/pmix: ERR done contrib: -</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_ring.c:823 [pmixp_coll_ring_log] mpi/pmix: ERR wait contrib: trek9</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_ring.c:825 [pmixp_coll_ring_log] mpi/pmix: ERR status=PMIXP_COLL_RING_PROGRESS</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_ring.c:829 [pmixp_coll_ring_log] mpi/pmix: ERR buf (offset/size): 553/1659</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1317 [pmixp_coll_tree_reset_if_to] mpi/: ERROR: 0x7f4f5c01eb90: collective timeout seq=0</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll.c:281 [pmixp_coll_log] mpi/pmix: ERROR: Dumpinllective state</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1336 [pmixp_coll_tree_log] mpi/pmix: ER 0x7f4f5c01eb90: COLL_FENCE_TREE state seq=0 contribs: loc=1/prnt=0/child=0</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1338 [pmixp_coll_tree_log] mpi/pmix: ER my peerid: 0:trek8</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1341 [pmixp_coll_tree_log] mpi/pmix: ER root host: 0:trek8</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1355 [pmixp_coll_tree_log] mpi/pmix: ER child contribs [1]:</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1382 [pmixp_coll_tree_log] mpi/pmix: ER done contrib: -</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1384 [pmixp_coll_tree_log] mpi/pmix: ER wait contrib: trek9</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1391 [pmixp_coll_tree_log] mpi/pmix: ER status: coll=COLL_COLLECT upfw=COLL_SND_NONE dfwd=COLL_SND_NONE</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1393 [pmixp_coll_tree_log] mpi/pmix: ER dfwd status: dfwd_cb_cnt=0, dfwd_cb_wait=0</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1396 [pmixp_coll_tree_log] mpi/pmix: ER bufs (offset/size): upfw 91/16415, dfwd 64/16415</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: trek8 [0] pmixp_dmdx.c:466 [pmixp_dmdx_timeout_cleanup] mpi/pmix: E: timeout: ns=slurm.pmix.753.0, rank=1, host=trek9, ts=1563206701</p>
<p class="yiv5886144848MsoNormal">[trek8:23061] pml_ucx.c:176 Error: Failed to receive UCX worker address: Not found (-</p>
<p class="yiv5886144848MsoNormal">[trek8:23061] pml_ucx.c:447 Error: Failed to resolve UCX endpoint for rank 1</p>
<p class="yiv5886144848MsoNormal">[trek8:23061] *** An error occurred in MPI_Barrier</p>
<p class="yiv5886144848MsoNormal">[trek8:23061] *** reported by process [1543534944,0]</p>
<p class="yiv5886144848MsoNormal">[trek8:23061] *** on communicator MPI_COMM_WORLD</p>
<p class="yiv5886144848MsoNormal">[trek8:23061] *** MPI_ERR_OTHER: known error not in list</p>
<p class="yiv5886144848MsoNormal">[trek8:23061] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,</p>
<p class="yiv5886144848MsoNormal">[trek8:23061] *** and potentially your MPI job)</p>
<p class="yiv5886144848MsoNormal">slurmstepd: error: *** STEP 753.0 ON trek8 CANCELLED AT 2019-07-15T09:05:01 ***</p>
<p class="yiv5886144848MsoNormal">srun: Job step aborted: Waiting up to 32 seconds for job step to finish.</p>
<p class="yiv5886144848MsoNormal">srun: error: trek8: task 0: Exited with exit code 16</p>
<p class="yiv5886144848MsoNormal">[slurm@trek8 mpihello]$</p>
</div>
</div>
</div></div>
</div>
</div></body></html>