[slurm-users] Running pyMPI on several nodes
hearnsj at googlemail.com
Fri Jul 12 15:37:55 UTC 2019
Par, by 'poking around' Crhis means to use tools such as netstat and lsof.
Also I would look as ps -eaf --forest to make sure there are no 'orphaned'
jusbs sitting on that compute node.
Having said that though, I have a dim memory of a classic PBSPro error
message which says something about a network connection,
but really means that you cannot open a remote session on that compute
As an aside, you have checked that your username exists on that compue
server? getent passwd par
Also that your home directory is mounted - or something substituting for
your home directory?
On Fri, 12 Jul 2019 at 15:55, Chris Samuel <chris at csamuel.org> wrote:
> On 12/7/19 7:39 am, Pär Lundö wrote:
> > Presumably, the first 8 tasks originates from the first node (in this
> > case the lxclient11), and the other node (lxclient10) response as
> > predicted.
> That looks right, it seems the other node has two processes fighting
> over the same socket and that's breaking Slurm there.
> > Is it neccessary to have passwordless ssh communication alongside the
> > munge authentication?
> No, srun doesn't need (or use) that at all.
> > In addition I checked the slurmctld-log from both the server and client
> > and found something (noted in bold):
> This is from the slurmd log on the client from the look of it.
> > *[2019-07-12T14:57:53.771][83.0] task_p_pre_launch: Using sched affinity
> > for tasks lurm.pmix.83.0: Address already in use*
> > [2019-07-12T14:57:53.682][83.0] error: lxclient /pmix.server.c:386
> > [pmix_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
> > [2019-07-12T14:57:53.683][83.0] error: (null)  /mpi_pmix:156
> > [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init()
> That indicates that something else has grabbed the socket it wants and
> that's why the setup of the MPI ranks on the second node fails.
> You'll want to poke around there to see what's using it.
> Best of luck!
> Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users