[slurm-users] Running pyMPI on several nodes

Chris Samuel chris at csamuel.org
Fri Jul 12 14:53:37 UTC 2019


On 12/7/19 7:39 am, Pär Lundö wrote:

> Presumably, the first 8 tasks originates from the first node (in this 
> case the lxclient11), and the other node (lxclient10) response as 
> predicted.

That looks right, it seems the other node has two processes fighting 
over the same socket and that's breaking Slurm there.

> Is it neccessary to have passwordless ssh communication alongside the 
> munge authentication?

No, srun doesn't need (or use) that at all.

> In addition I checked the slurmctld-log from both the server and client 
> and found something (noted in bold):

This is from the slurmd log on the client from the look of it.

> *[2019-07-12T14:57:53.771][83.0] task_p_pre_launch: Using sched affinity 
> for tasks lurm.pmix.83.0: Address already in use[98]*
> [2019-07-12T14:57:53.682][83.0] error: lxclient[0] /pmix.server.c:386 
> [pmix_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
> [2019-07-12T14:57:53.683][83.0] error: (null) [0] /mpi_pmix:156 
> [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed

That indicates that something else has grabbed the socket it wants and 
that's why the setup of the MPI ranks on the second node fails.

You'll want to poke around there to see what's using it.

Best of luck!
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



More information about the slurm-users mailing list