[slurm-users] Running pyMPI on several nodes

Pär Lundö par.lundo at foi.se
Fri Jul 12 14:39:25 UTC 2019


Hi,
Thank you for your response.
When I do run it ("srun -N2 -n8 hostname") I get an error stating:

"srun: job step 83.0 aborted before step completely launched.
srun: error: task 0 launced failed: Unspecified error.
srun: error: task 1 launced failed: Unspecified error.
srun: error: task 2 Launced failed: Unspecified error.
srun: error: task 3 launced failed: Unspecified error.
srun: error: task 4 launced failed: Unspecified error.
srun: error: task 5 launced failed: Unspecified error.
srun: error: task 6 launced failed: Unspecified error.
 srun: error: task 7 launced failed: Unspecified error.
lxclient10
lxclient10
lxclient10
lxclient10
lxclient10
lxclient10
lxclient10
lxclient10
"
Presumably, the first 8 tasks originates from the first node (in this case the lxclient11), and the other node (lxclient10) response as predicted.
Is it neccessary to have passwordless ssh communication alongside the munge authentication?

In addition I checked the slurmctld-log from both the server and client and found something (noted in bold):
"[2019-07-12T14:57:53.543] launch task 83.0 from UID 1000 GID: 1000 HOST:192.168.1.1 PORT:4810
[2019-07-12T14:57:53.544] lllp distribution jobid[83] implicit auto binding: cores.one_thread.dist 8192
[2019-07-12T14:57:53.544] _task_layout_lllp_cyclic
[2019-07-12T14:57:53.544] _lllp_generate_cpu bind jobid [83]: mask_cpu, one_thread, 0x10, 0x01, 0x20, 0x02, 0x40, 0x04, 0x80
[2019-07-12T14:57:53.545] _run_prolog: run job script took usec=11
[2019-07-12T14:57:53.543] _run_prolog: prolog with lock for job 83 ran for 0 seconds
[2019-07-12T14:57:53.771] [83.0] task_p_pre_launch: Using sched_affinity for tasks
[2019-07-12T14:57:53.771][83.0] task_p_pre_launch: Using sched affinity for tasks lurm.pmix.83.0: Address already in use[98]
[2019-07-12T14:57:53.682][83.0] error: lxclient[0] /pmix.server.c:386 [pmix_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
[2019-07-12T14:57:53.683][83.0] error: (null) [0] /mpi_pmix:156 [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed
[2019-07-12T14:57:53.686][83.0] error: Failed mpi_hook_slurmstepd_prefork
[2019-07-12T14:57:53.691][83.0] error: job_manage existing abnormally, rc=1
ks
[2019-07-12T14:57:53.772][83.0] task_p_pre_launch: Using sched_affinity for tasks
[2019-07-12T14:57:53.772][83.0] task_p_pre_launch: Using sched_affinity for tasks
[2019-07-12T14:57:53.772][83.0] task_p_pre_launch: Using sched_affinity for tasks
[2019-07-12T14:57:53.775][83.0] task_p_pre_launch: Using sched_affinity for tasks
[2019-07-12T14:57:56.004][83.0] done with job
[2019-07-12T14:57:56.005][83.0] error: Unable to unlink domain socket ´/var/spool/slurmd/lxclient10_83.0´: No such file or directory
[2019-07-12T14:57:56.019][83.0] done with job
"

Best regards
Palle


________________________________
From: "slurm-users" <slurm-users-bounces at lists.schedmd.com>
Sent: 12 juli 2019 08:46
To: "Slurm User Community List" <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Running pyMPI on several nodes

MY apology. You do say that the Python program simply printe the rank - so is a hello world program.

On Fri, 12 Jul 2019 at 07:45, John Hearns < hearnsj at googlemail.com<mailto:hearnsj at googlemail.com>> wrote:
Please try something very simple such as a hello world program or
srun -N2 -n8 hostname

What is the error message which you have ?

On Fri, 12 Jul 2019 at 07:07, Pär Lundö < par.lundo at foi.se<mailto:par.lundo at foi.se>> wrote:

Hi there Slurm-experts!
I am  trouble using or running a python-mpi program involving more than one node. The pythom-mpi program is very simple, it only lists the number of ranks that is available in its environment. I have a munge-daemon running prior to starting the slurm-service and the program works when using a single node (so I suppose munge is working).
In addition, I have tested to run a simple sbatch-script where each available node (four nodes) states its hostname and returns.
Since authentication with Slurm is used via munge, do I need a passwordless SSH communication between the slurmctl and the nodes? (I found a guide,probably outdated stating that passwordless SSH communication is a neccessity for slurm, HTTP://admin-magazine.com/HPC/Articles/Resource-Management-with-Slurm<http://admin-magazine.com/HPC/Articles/Resource-Management-with-Slurm>).

I run the python-mpi program via a sbatch-script,invoking a srun-command. Each node has 8 CPUs.
The srun-command is :
”srun -N2 -n8 python3 python-mpi.py” ,
when tested on two nodes.
It works fine running on a single node(with ”-N1” instead of ”-N2”), but it is aborted or stopped when running on two nodes.
Should I have ”-n16” when running on two nodes? (In order to allocate the complete number of CPUs available of the two nodes.)
Slurm is configured and built with pmix.
I am running Slurm 19.05 on Ubuntu 18.04 as server and the nodes are running same slurm-version on Ubuntu 18.10.

Best rehards,

Palle
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190712/f0ba7174/attachment-0001.htm>


More information about the slurm-users mailing list