[slurm-users] Running pyMPI on several nodes

Pär Lundö par.lundo at foi.se
Fri Jul 12 17:34:41 UTC 2019


Hi,

Thank you so much for your quick responses!
It is much appreciated.
I dont have access to the cluster until next week, but I’ll be sure to follow up on all of your suggestions and get back you next week.

Have a nice weekend!
Best regards
Palle

________________________________
From: "slurm-users" <slurm-users-bounces at lists.schedmd.com>
Sent: 12 juli 2019 17:37
To: "Slurm User Community List" <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Running pyMPI on several nodes

Par, by 'poking around' Crhis means to use tools such as netstat and lsof.
Also I would look as ps -eaf --forest to make sure there are no 'orphaned' jusbs sitting on that compute node.

Having said that though, I have a dim memory of a classic PBSPro error message which says something about a network connection,
but really means that you cannot open a remote session on that compute server.

As an aside, you have checked that your username exists on that compue server?      getent passwd par
Also that your home directory is mounted - or something substituting for your home directory?


On Fri, 12 Jul 2019 at 15:55, Chris Samuel < chris at csamuel.org<mailto:chris at csamuel.org>> wrote:
On 12/7/19 7:39 am, Pär Lundö wrote:

> Presumably, the first 8 tasks originates from the first node (in this
> case the lxclient11), and the other node (lxclient10) response as
> predicted.

That looks right, it seems the other node has two processes fighting
over the same socket and that's breaking Slurm there.

> Is it neccessary to have passwordless ssh communication alongside the
> munge authentication?

No, srun doesn't need (or use) that at all.

> In addition I checked the slurmctld-log from both the server and client
> and found something (noted in bold):

This is from the slurmd log on the client from the look of it.

> *[2019-07-12T14:57:53.771][83.0] task_p_pre_launch: Using sched affinity
> for tasks lurm.pmix.83.0: Address already in use[98]*
> [2019-07-12T14:57:53.682][83.0] error: lxclient[0] /pmix.server.c:386
> [pmix_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
> [2019-07-12T14:57:53.683][83.0] error: (null) [0] /mpi_pmix:156
> [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed

That indicates that something else has grabbed the socket it wants and
that's why the setup of the MPI ranks on the second node fails.

You'll want to poke around there to see what's using it.

Best of luck!
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190712/35b7ea43/attachment-0001.htm>


More information about the slurm-users mailing list