[slurm-users] PMIX with heterogeneous jobs

Mehlberg, Steve steve.mehlberg at atos.net
Tue Jul 16 23:03:52 UTC 2019


Philip, Thanks for trying 18.08.8 for me.  I finally got a system built with 18.08.8 and I’m having much better success running heterogeneous jobs with PMIX.  I haven’t seen the intermittent problem you have - but I’ve just started testing.  I wonder if there is a bug in 19.05.1?

$ sinfo -V
slurm 18.08.8

$ srun -wtrek8 -n2 --mpi=pmix : -wtrek9 -n2 mpihh | sort
srun: job 46073 queued and waiting for resources
srun: job 46073 has been allocated resources
Hello world, I am 0 of 4 - running on trek8
Hello world, I am 1 of 4 - running on trek8
Hello world, I am 2 of 4 - running on trek9
Hello world, I am 3 of 4 - running on trek9

From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Philip Kovacs
Sent: Tuesday, July 16, 2019 12:03 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] PMIX with heterogeneous jobs

Well it looks like it it does fail as often as it works.

srun --mpi=pmix -n1 -wporthos : -n1 -wathos ./hello
srun: job 681 queued and waiting for resources
srun: job 681 has been allocated resources
slurmstepd: error: athos [0] pmixp_coll_ring.c:613 [pmixp_coll_ring_check] mpi/pmix: ERROR: 0x153ab0017e00: unexpected contrib from athos:0, expected is 1
slurmstepd: error: athos [0] pmixp_server.c:930 [_process_server_request] mpi/pmix: ERROR: 0x153ab0017e00: unexpected contrib from athos:0, coll->seq=0, seq=0
slurmstepd: error: porthos [1] pmixp_coll_ring.c:613 [pmixp_coll_ring_check] mpi/pmix: ERROR: 0x146fdc016bd0: unexpected contrib from porthos:1, expected is 0
slurmstepd: error: porthos [1] pmixp_server.c:930 [_process_server_request] mpi/pmix: ERROR: 0x146fdc016bd0: unexpected contrib from porthos:1, coll->seq=0, seq=0



On Tuesday, July 16, 2019, 09:49:59 AM EDT, Mehlberg, Steve <steve.mehlberg at atos.net<mailto:steve.mehlberg at atos.net>> wrote:



Has anyone been able to run an MPI job using PMIX and heterogeneous jobs successfully with 19.05 (or even 18.08)?  I can run without heterogeneous jobs but get all sorts of errors when I try and split the job up.

I haven’t used MPI/PMIX much so maybe I’m missing something?  Any ideas?



[slurm at trek8 mpihello]$ sinfo -V

slurm 19.05.1



[slurm at trek8 mpihello]$ which mpicc

/opt/openmpi/4.0.1/bin/mpicc

[slurm at trek8 mpihello]$ sudo yum list pmix

Loaded plugins: langpacks, product-id, search-disabled-repos, subscription-manager

Installed Packages

pmix.x86_64                       3.1.2rc1.debug-1.el7                       installed



[slurm at trek8 mpihello]$ mpicc mpihello.c -o mpihh



[slurm at trek8 mpihello]$ srun -w trek[8-12] -n5 --mpi=pmix mpihh | sort

Hello world, I am 0 of 5 - running on trek8

Hello world, I am 1 of 5 - running on trek9

Hello world, I am 2 of 5 - running on trek10

Hello world, I am 3 of 5 - running on trek11

Hello world, I am 4 of 5 - running on trek12



[slurm at trek8 mpihello]$ srun -w trek8 --mpi=pmix : -w trek9 mpihh

srun: job 753 queued and waiting for resources

srun: job 753 has been allocated resources

srun: error: (null) [0] mpi_pmix.c:228 [p_mpi_hook_client_prelaunch] mpi/pmix: ERROR: ot create process mapping

srun: error: Application launch failed: MPI plugin's pre-launch setup failed

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

srun: error: Timed out waiting for job step to complete

slurmstepd: error: trek8 [0] pmixp_utils.c:457 [pmixp_p2p_send] mpi/pmix: ERROR: send ed, rc=2, exceeded the retry limit

slurmstepd: error: trek8 [0] pmixp_server.c:1493 [_slurm_send] mpi/pmix: ERROR: Cannotd message to /var/tmp/sgm-slurm/slurmd.spool/stepd.slurm.pmix.753.0, size = 649, hostl

(null)

slurmstepd: error: trek8 [0] pmixp_coll_ring.c:738 [pmixp_coll_ring_reset_if_to] mpi/p ERROR: 0x7f4f5c016050: collective timeout seq=0

slurmstepd: error: trek8 [0] pmixp_coll.c:281 [pmixp_coll_log] mpi/pmix: ERROR: Dumpinllective state

slurmstepd: error: trek8 [0] pmixp_coll_ring.c:756 [pmixp_coll_ring_log] mpi/pmix: ERR0x7f4f5c016050: COLL_FENCE_RING state seq=0

slurmstepd: error: trek8 [0] pmixp_coll_ring.c:758 [pmixp_coll_ring_log] mpi/pmix: ERRmy peerid: 0:trek8

slurmstepd: error: trek8 [0] pmixp_coll_ring.c:765 [pmixp_coll_ring_log] mpi/pmix: ERRneighbor id: next 1:trek9, prev 1:trek9

slurmstepd: error: trek8 [0] pmixp_coll_ring.c:775 [pmixp_coll_ring_log] mpi/pmix: ERRContext ptr=0x7f4f5c0160d0, #0, in-use=0

slurmstepd: error: trek8 [0] pmixp_coll_ring.c:775 [pmixp_coll_ring_log] mpi/pmix: ERRContext ptr=0x7f4f5c016108, #1, in-use=0

slurmstepd: error: trek8 [0] pmixp_coll_ring.c:775 [pmixp_coll_ring_log] mpi/pmix: ERRContext ptr=0x7f4f5c016140, #2, in-use=1

slurmstepd: error: trek8 [0] pmixp_coll_ring.c:786 [pmixp_coll_ring_log] mpi/pmix: ERR seq=0 contribs: loc=1/prev=0/fwd=1

slurmstepd: error: trek8 [0] pmixp_coll_ring.c:788 [pmixp_coll_ring_log] mpi/pmix: ERR neighbor contribs [2]:

slurmstepd: error: trek8 [0] pmixp_coll_ring.c:821 [pmixp_coll_ring_log] mpi/pmix: ERR done contrib: -

slurmstepd: error: trek8 [0] pmixp_coll_ring.c:823 [pmixp_coll_ring_log] mpi/pmix: ERR wait contrib: trek9

slurmstepd: error: trek8 [0] pmixp_coll_ring.c:825 [pmixp_coll_ring_log] mpi/pmix: ERR status=PMIXP_COLL_RING_PROGRESS

slurmstepd: error: trek8 [0] pmixp_coll_ring.c:829 [pmixp_coll_ring_log] mpi/pmix: ERR buf (offset/size): 553/1659

slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1317 [pmixp_coll_tree_reset_if_to] mpi/: ERROR: 0x7f4f5c01eb90: collective timeout seq=0

slurmstepd: error: trek8 [0] pmixp_coll.c:281 [pmixp_coll_log] mpi/pmix: ERROR: Dumpinllective state

slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1336 [pmixp_coll_tree_log] mpi/pmix: ER 0x7f4f5c01eb90: COLL_FENCE_TREE state seq=0 contribs: loc=1/prnt=0/child=0

slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1338 [pmixp_coll_tree_log] mpi/pmix: ER my peerid: 0:trek8

slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1341 [pmixp_coll_tree_log] mpi/pmix: ER root host: 0:trek8

slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1355 [pmixp_coll_tree_log] mpi/pmix: ER child contribs [1]:

slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1382 [pmixp_coll_tree_log] mpi/pmix: ER         done contrib: -

slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1384 [pmixp_coll_tree_log] mpi/pmix: ER         wait contrib: trek9

slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1391 [pmixp_coll_tree_log] mpi/pmix: ER status: coll=COLL_COLLECT upfw=COLL_SND_NONE dfwd=COLL_SND_NONE

slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1393 [pmixp_coll_tree_log] mpi/pmix: ER dfwd status: dfwd_cb_cnt=0, dfwd_cb_wait=0

slurmstepd: error: trek8 [0] pmixp_coll_tree.c:1396 [pmixp_coll_tree_log] mpi/pmix: ER bufs (offset/size): upfw 91/16415, dfwd 64/16415

slurmstepd: error: trek8 [0] pmixp_dmdx.c:466 [pmixp_dmdx_timeout_cleanup] mpi/pmix: E: timeout: ns=slurm.pmix.753.0, rank=1, host=trek9, ts=1563206701

[trek8:23061] pml_ucx.c:176  Error: Failed to receive UCX worker address: Not found (-

[trek8:23061] pml_ucx.c:447  Error: Failed to resolve UCX endpoint for rank 1

[trek8:23061] *** An error occurred in MPI_Barrier

[trek8:23061] *** reported by process [1543534944,0]

[trek8:23061] *** on communicator MPI_COMM_WORLD

[trek8:23061] *** MPI_ERR_OTHER: known error not in list

[trek8:23061] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

[trek8:23061] ***    and potentially your MPI job)

slurmstepd: error: *** STEP 753.0 ON trek8 CANCELLED AT 2019-07-15T09:05:01 ***

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

srun: error: trek8: task 0: Exited with exit code 16

[slurm at trek8 mpihello]$
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190716/f06b7cf5/attachment-0001.htm>


More information about the slurm-users mailing list