<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<style type="text/css">html { -webkit-text-size-adjust: none; -webkit-touch-callout: none;}</style>
</head>
<body style="margin-top: 8px;">
<div dir="ltr"><span style="background-color: rgba(255, 255, 255, 0);">Hi, </span>
<div><span style="background-color: rgba(255, 255, 255, 0);">Thank you for your response.</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">When I do run it ("srun -N2 -n8 hostname") I get an error stating:</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);"><br>
</span></div>
<div><span style="background-color: rgba(255, 255, 255, 0);">"srun: job step 83.0 aborted before step completely launched.</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">srun: error: task 0 launced failed: Unspecified error.</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">srun: error: task 1 launced failed: Unspecified error. </span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">srun: error: task 2 Launced failed: Unspecified error.</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">srun: error: task 3 launced failed: Unspecified error.</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">srun: error: task 4 launced failed: Unspecified error.</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">srun: error: task 5 launced failed: Unspecified error.</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">srun: error: task 6 launced failed: Unspecified error.   <br>
</span></div>
<div><span style="background-color: rgba(255, 255, 255, 0);"> srun: error: task 7 launced failed: Unspecified error. <br>
</span></div>
<div><span style="background-color: rgba(255, 255, 255, 0);">lxclient10</span> </div>
<div><span style="background-color: rgba(255, 255, 255, 0);">lxclient10</span> </div>
<div><span style="background-color: rgba(255, 255, 255, 0);">lxclient10</span> </div>
<div><span style="background-color: rgba(255, 255, 255, 0);">lxclient10</span> </div>
<div><span style="background-color: rgba(255, 255, 255, 0);">lxclient10</span> </div>
<div><span style="background-color: rgba(255, 255, 255, 0);">lxclient10</span> </div>
<div><span style="background-color: rgba(255, 255, 255, 0);">lxclient10</span> </div>
<div><span style="background-color: rgba(255, 255, 255, 0);">lxclient10</span> </div>
<div><span style="background-color: rgba(255, 255, 255, 0);">"</span> </div>
<div><span style="background-color: rgba(255, 255, 255, 0);">Presumably, the first 8 tasks originates from the first node (in this case the lxclient11), and the other node (lxclient10) response as predicted.</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">Is it neccessary to have passwordless ssh communication alongside the munge authentication?</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);"><br>
</span></div>
<div><span style="background-color: rgba(255, 255, 255, 0);">In addition I checked the slurmctld-log from both the server and client and found something (noted in bold):</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">"[2019-07-12T14:57:53.543] launch task 83.0 from UID 1000 GID: 1000 HOST:192.168.1.1 PORT:4810</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:53.544] lllp distribution jobid[83] implicit auto binding: cores.one_thread.dist 8192 </span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:53.544] _task_layout_lllp_cyclic</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:53.544] _lllp_generate_cpu bind jobid [83]: mask_cpu, one_thread, 0x10, 0x01, 0x20, 0x02, 0x40, 0x04, 0x80</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:53.545] _run_prolog: run job script took usec=11</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:53.543] _run_prolog: prolog with lock for job 83 ran for 0 seconds</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:53.771] [83.0] task_p_pre_launch: Using sched_affinity for tasks</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);"><b>[2019-07-12T14:57:53.771][83.0] task_p_pre_launch: Using sched affinity for tasks lurm.pmix.83.0: Address already in use[98]</b>  <br>
</span></div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:53.682][83.0] error: lxclient[0] /pmix.server.c:386 [pmix_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:53.683][83.0] error: (null) [0] /mpi_pmix:156 [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:53.686][83.0] error: Failed mpi_hook_slurmstepd_prefork</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:53.691][83.0] error: job_manage existing abnormally, rc=1 <br>
</span></div>
<div><span style="background-color: rgba(255, 255, 255, 0);">ks</span> </div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:53.772][83.0] task_p_pre_launch: Using sched_affinity for tasks</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:53.772][83.0] task_p_pre_launch: Using sched_affinity for tasks</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:53.772][83.0] task_p_pre_launch: Using sched_affinity for tasks</span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:53.775][83.0] task_p_pre_launch: Using sched_affinity for tasks   <br>
</span></div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:56.004][83.0] done with job <br>
</span></div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:56.005][83.0] error: Unable to unlink domain socket ´/var/spool/slurmd/lxclient10_83.0´: No such file or directory</span>
</div>
<div>
<div><span style="background-color: rgba(255, 255, 255, 0);">[2019-07-12T14:57:56.019][83.0] done with job <br>
</span></div>
<div><span style="background-color: rgba(255, 255, 255, 0);">"</span> </div>
<div><span style="background-color: rgba(255, 255, 255, 0);"><br>
</span></div>
<div><span style="background-color: rgba(255, 255, 255, 0);">Best regards </span>
</div>
<div><span style="background-color: rgba(255, 255, 255, 0);">Palle</span> </div>
</div>
<div style="color: black; font-family: Calibri, Tahoma, Arial, Helvetica, sans-serif; font-size: 11pt;">
<br>
</div>
<br>
<hr style="color: black; font-family: Calibri, Tahoma, Arial, Helvetica, sans-serif; font-size: 11pt;">
<b style="color: black; font-family: Calibri, Tahoma, Arial, Helvetica, sans-serif; font-size: 11pt;">From:</b>
<font face="Calibri, Tahoma, Arial, Helvetica, sans-serif"><span style="font-size: 11pt;">"slurm-users" <slurm-users-bounces@lists.schedmd.com></span></font>
<br>
<b style="color: black; font-family: Calibri, Tahoma, Arial, Helvetica, sans-serif; font-size: 11pt;">Sent:</b>
<font face="Calibri, Tahoma, Arial, Helvetica, sans-serif"><span style="font-size: 11pt;">12 juli 2019 08:46</span></font>
<br>
<b style="color: black; font-family: Calibri, Tahoma, Arial, Helvetica, sans-serif; font-size: 11pt;">To:</b>
<font face="Calibri, Tahoma, Arial, Helvetica, sans-serif"><span style="font-size: 11pt;">"Slurm User Community List" <slurm-users@lists.schedmd.com></span></font>
<br>
<b style="color: black; font-family: Calibri, Tahoma, Arial, Helvetica, sans-serif; font-size: 11pt;">Subject:</b>
<font face="Calibri, Tahoma, Arial, Helvetica, sans-serif"><span style="font-size: 11pt;">Re: [slurm-users] Running pyMPI on several nodes</span></font>
<br>
<br>
</div>
<div dir="ltr">MY apology. You do say that the Python program simply printe the rank - so is a hello world program.
</div>
<br>
<div class="gmail_quote">
<div class="gmail_attr" dir="ltr">On Fri, 12 Jul 2019 at 07:45, John Hearns < <a href="mailto:hearnsj@googlemail.com">
hearnsj@googlemail.com</a>> wrote: <br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid">
<div dir="ltr">
<div dir="ltr">
<div>Please try something very simple such as a hello world program or    </div>
<div><span style="text-align:left;color:black;text-transform:none;text-indent:0px;letter-spacing:normal;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt;font-style:normal;font-variant:normal;font-weight:400;text-decoration:none;word-spacing:0px;display:inline;white-space:normal;float:none;background-color:rgb(255,255,255)">srun
 -N2 -n8 hostname</span> </div>
<div><br>
</div>
<div>What is the error message which you have ? </div>
</div>
</div>
<br>
<div class="gmail_quote">
<div class="gmail_attr" dir="ltr">On Fri, 12 Jul 2019 at 07:07, Pär Lundö < <a href="mailto:par.lundo@foi.se" target="_blank">
par.lundo@foi.se</a>> wrote: <br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid">
<div style="margin-top:8px">
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
<br>
Hi there Slurm-experts! </div>
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
I am  trouble using or running a python-mpi program involving more than one node. The pythom-mpi program is very simple, it only lists the number of ranks that is available in its environment. I have a munge-daemon running prior to starting the slurm-service
 and the program works when using a single node (so I suppose munge is working). </div>
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
In addition, I have tested to run a simple sbatch-script where each available node (four nodes) states its hostname and returns.
</div>
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
Since authentication with Slurm is used via munge, do I need a passwordless SSH communication between the slurmctl and the nodes? (I found a guide,probably outdated stating that passwordless SSH communication is a neccessity for slurm,
<a href="http://admin-magazine.com/HPC/Articles/Resource-Management-with-Slurm" target="_blank">
HTTP://admin-magazine.com/HPC/Articles/Resource-Management-with-Slurm</a>). </div>
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
<br>
</div>
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
I run the python-mpi program via a sbatch-script,invoking a srun-command. Each node has 8 CPUs.
</div>
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
The srun-command is : </div>
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
”srun -N2 -n8 python3 python-mpi.py” , </div>
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
when tested on two nodes.  </div>
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
It works fine running on a single node(with ”-N1” instead of ”-N2”), but it is aborted or stopped when running on two nodes.
</div>
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
Should I have ”-n16” when running on two nodes? (In order to allocate the complete number of CPUs available of the two nodes.)
</div>
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
Slurm is configured and built with pmix. </div>
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
I am running Slurm 19.05 on Ubuntu 18.04 as server and the nodes are running same slurm-version on Ubuntu 18.10.
</div>
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
<br>
</div>
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
Best rehards, </div>
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
<br>
</div>
<div style="color:black;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;font-size:11pt" dir="ltr">
Palle  </div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</body>
</html>