[slurm-users] Running pyMPI on several nodes

John Hearns hearnsj at googlemail.com
Fri Jul 12 15:37:55 UTC 2019


Par, by 'poking around' Crhis means to use tools such as netstat and lsof.
Also I would look as ps -eaf --forest to make sure there are no 'orphaned'
jusbs sitting on that compute node.

Having said that though, I have a dim memory of a classic PBSPro error
message which says something about a network connection,
but really means that you cannot open a remote session on that compute
server.

As an aside, you have checked that your username exists on that compue
server?      getent passwd par
Also that your home directory is mounted - or something substituting for
your home directory?


On Fri, 12 Jul 2019 at 15:55, Chris Samuel <chris at csamuel.org> wrote:

> On 12/7/19 7:39 am, Pär Lundö wrote:
>
> > Presumably, the first 8 tasks originates from the first node (in this
> > case the lxclient11), and the other node (lxclient10) response as
> > predicted.
>
> That looks right, it seems the other node has two processes fighting
> over the same socket and that's breaking Slurm there.
>
> > Is it neccessary to have passwordless ssh communication alongside the
> > munge authentication?
>
> No, srun doesn't need (or use) that at all.
>
> > In addition I checked the slurmctld-log from both the server and client
> > and found something (noted in bold):
>
> This is from the slurmd log on the client from the look of it.
>
> > *[2019-07-12T14:57:53.771][83.0] task_p_pre_launch: Using sched affinity
> > for tasks lurm.pmix.83.0: Address already in use[98]*
> > [2019-07-12T14:57:53.682][83.0] error: lxclient[0] /pmix.server.c:386
> > [pmix_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
> > [2019-07-12T14:57:53.683][83.0] error: (null) [0] /mpi_pmix:156
> > [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init()
> failed
>
> That indicates that something else has grabbed the socket it wants and
> that's why the setup of the MPI ranks on the second node fails.
>
> You'll want to poke around there to see what's using it.
>
> Best of luck!
> Chris
> --
>   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190712/7fb57d7f/attachment.htm>


More information about the slurm-users mailing list