[slurm-users] Running pyMPI on several nodes
chris at csamuel.org
Fri Jul 12 14:53:37 UTC 2019
On 12/7/19 7:39 am, Pär Lundö wrote:
> Presumably, the first 8 tasks originates from the first node (in this
> case the lxclient11), and the other node (lxclient10) response as
That looks right, it seems the other node has two processes fighting
over the same socket and that's breaking Slurm there.
> Is it neccessary to have passwordless ssh communication alongside the
> munge authentication?
No, srun doesn't need (or use) that at all.
> In addition I checked the slurmctld-log from both the server and client
> and found something (noted in bold):
This is from the slurmd log on the client from the look of it.
> *[2019-07-12T14:57:53.771][83.0] task_p_pre_launch: Using sched affinity
> for tasks lurm.pmix.83.0: Address already in use*
> [2019-07-12T14:57:53.682][83.0] error: lxclient /pmix.server.c:386
> [pmix_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
> [2019-07-12T14:57:53.683][83.0] error: (null)  /mpi_pmix:156
> [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed
That indicates that something else has grabbed the socket it wants and
that's why the setup of the MPI ranks on the second node fails.
You'll want to poke around there to see what's using it.
Best of luck!
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
More information about the slurm-users