[slurm-users] [EXT] Re: pmix issue

Yuengling, Philip J. Philip.Yuengling at jhuapl.edu
Tue Dec 8 02:02:27 UTC 2020


Thanks everyone for your replies!

It turned out to be a library dependency for libevent wasn’t being found as needed.  While I thought I was using a shared-location library, I was not.  I had apparently set up /etc/ld.so.conf.d on the build host to use /usr/local/lib which… has libevent in it.  But none of the other nodes had this.  The problem became apparent after going to each node and running pmix_info.

This means I should remove the ld.so.conf.d entry and rebuild everything against the preferred set of libraries.  Otherwise pmix appears to work as expected now.

Cheers!
Phil

From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Philip Kovacs <pkdevel at yahoo.com>
Reply-To: Philip Kovacs <pkdevel at yahoo.com>, Slurm User Community List <slurm-users at lists.schedmd.com>
Date: Monday, December 7, 2020 at 10:55 AM
To: "andy at candooz.com" <andy at candooz.com>, Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] [EXT] Re: pmix issue

APL external email warning: Verify sender slurm-users-bounces at lists.schedmd.com before clicking links or attachments



Make sure the .so symlink for the pmix lib is available -- not just the versioned .so, e.g. .so.2.   Slurm requires that .so symlink.  Some distros split packages into base/devel, so you may need to install a pmix-devel package, if available, in order to add the .so symlink (which is considered a "development" file).

On Monday, December 7, 2020, 09:22:06 AM EST, Yuengling, Philip J. <philip.yuengling at jhuapl.edu> wrote:



Thanks Andy,



Slurm was compiled with --with-pmix=/share/local/pmix-3.2.1.  The build of pmix is installed under /share/local/pmix-3.2.1 which is an NFS share across all the nodes.  I should also note I used devtoolset-10 (gcc 10) on RHEL7 and confirmed that everything was compiled with that version of compiler.



I also set LD_LIBRARY_PATH to include /share/local/pmix-3.2.1



Cheers!

Phil



From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Andy Riebs <andy at candooz.com>
Reply-To: "andy at candooz.com" <andy at candooz.com>, Slurm User Community List <slurm-users at lists.schedmd.com>
Date: Friday, December 4, 2020 at 3:07 PM
To: "slurm-users at lists.schedmd.com" <slurm-users at lists.schedmd.com>
Subject: [EXT] Re: [slurm-users] pmix issue



APL external email warning: Verify sender slurm-users-bounces at lists.schedmd.com before clicking links or attachments




Also, Slurm was built with "/fs/local/pmix-3.2.1" -- does that translate well to "/share/local/pmix-3.2.1"?

Andy

On 12/4/2020 2:59 PM, Andy Riebs wrote:

Are you sure that /share/local/pmix-3.2.1 exists on the compute nodes?

On 12/4/2020 2:54 PM, Yuengling, Philip J. wrote:

Hi everyone,



I’ve been having difficulty getting the --mpi=pmix_v3 option to work for me.  I can get --mpi=pmi2 to work ok, but I really want to understand what I’m doing wrong here.  Everything seems to build ok.



$ srun --mpi=list

srun: MPI types are...

srun: pmix

srun: pmix_v3

srun: cray_shasta

srun: none

srun: pmi2



$ srun --mpi=pmix_v3 -N5 date

srun: error: task 1 launch failed: Invalid MPI plugin name

srun: error: task 2 launch failed: Invalid MPI plugin name

srun: error: task 3 launch failed: Invalid MPI plugin name

srun: error: task 4 launch failed: Invalid MPI plugin name

srun: error: task 0 launch failed: Invalid MPI plugin name



$ srun --mpi=pmi2 -N5 date

Fri Dec  4 13:52:39 EST 2020

Fri Dec  4 13:52:39 EST 2020

Fri Dec  4 13:52:39 EST 2020

Fri Dec  4 13:52:39 EST 2020

Fri Dec  4 13:52:39 EST 2020





openpmix:

CC=/opt/rh/devtoolset-10/root/usr/bin/gcc ./configure --prefix=/share/local/pmix-3.2.1 --with-hwloc=/share/local/hwloc-2.4.0



Slurm 20.11.0:

rpmbuild --define "_with_pmix --with-pmix=/fs/local/pmix-3.2.1" -ta slurm-20.11.0.tar.bz2

From config.log:

./configure --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc/slurm --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-pmix=/fs/local/pmix-3.2.1 --disable-slurmrestd



Open MP 4.0.5:

./configure  '--prefix=/share/openmpi-4.0.5' '--with-cuda' '--with-pmix=/share/local/pmix-3.2.1' '--with-pmi=/usr' '--with-slurm' '--without-ucx' '--without-verbs'

--



Philip J. Yuengling

Johns Hopkins University

-->
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201208/71734106/attachment-0001.htm>


More information about the slurm-users mailing list