<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Hi,</p>
<p>I have now had the time to look at some of your suggestions.</p>
<p>First I tried running "srun -N1 hostname" via a sbatch-script,
while having two nodes up and running. <br>
"sinfo" yields that two nodes are up and idle prior to submitting
the sbatch-script.<br>
After submitting the job, I receive an error stating that:</p>
<p>"srun: error: Task launch for 86.0 failed on node lxclient11:
Invalid node name specified.<br>
srun: error: Application launch failed: Invalid node name
specified<br>
srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.<br>
srun: error: TImed out waiting for job step to complete"</p>
<p><br>
</p>
<p>From the log file at the client I get a more detailed error:<br>
" Launching batch job 86 for UID 1000 <br>
[86.batch] error: Invalid host_index -1 for job 86<br>
[86.batch] error: Host lxclient10 not in hostlist lxclient11<br>
[86.batch] task_pre_launch: Using sched_affinity for tasks<br>
rpc_launch_tasks: Invalid node list (lxclient10 not in
lxclient11)"<br>
</p>
<p>My two nodes are called lxclient10 and lxclient11.<br>
Why is my batch job launched with the UID 1000, shouldnt it be
launched via the slurm-user (which in my case has the UID 64030)?
<br>
What is meant by that the different nodes are not in the
nodeslist? <br>
The two nodes and the server share the same setup of IP-addresses
in the "/etc/hosts"-file.</p>
<p>-> This was resolved due to that lxclient10 was noted as down.
Getting it back up, the submitting of the same sbatch-script,
resulted in no error.<br>
However running it on two nodes I get an error <br>
"srun: error: Job Step 88.0 aborted before step completely
launched.<br>
srun: error: Job step aborted: Waiting up to 32 seconds for job
step to finish.<br>
srun: error: task 1 launched failed: Unspecifed error<br>
srun: error: lxclient10: task 0: Killed"</p>
<p>And in the slurmctld.log-file from the client I get an error
similiar to that prevously stated, that the pmix cannot bind UNIX
socket /var/spool/slurmd/stepd.slurm.pmix.88.0: Address already in
use (98)</p>
<p>I ran the lsof command, but I dont really know what I am looking
after, I can see if I grep with the different nodenames that the
two nodes have mounted the nfs-partition and that a link is
established.</p>
<p>"As an aside, you have checked that your username exists on that
compue server? getent passwd par<br>
Also that your home directory is mounted - or something
substituting for your home directory?"<br>
Yes, the user slurm exists on both nodes and have the same uid.<br>
<br>
"Have you tried
<br>
<br>
<br>
srun -N# -n# mpirun python3 ....
<br>
<br>
<br>
Perhaps you have no MPI environment being setup for the processes?
There was no "--mpi" flag in your "srun" command and we don't
know if you have a default value for that or not.
<br>
</p>
"
<p>In my slurm.conf-file I do specify that "MpiDefault=pmix" (And it
can be seen in the logfile that there is something wrong with
pmix, that the address already in use.)<br>
</p>
<p>One thing that struck my mind now is that I run these nodes as a
pair of diskless nodes, whom boots and mounts the same filesystem
which is supplied by a server. The run differen pids for different
processes which should not affect one another(?), right?</p>
<p><br>
</p>
<p>Best regards,</p>
<p>Palle<br>
</p>
<div class="moz-cite-prefix">On 2019-07-12 19:34, Pär Lundö wrote:<br>
</div>
<blockquote type="cite"
cite="mid:2f367747e17c41649410d2b62afa0717@foi.se">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<style type="text/css">html { -webkit-text-size-adjust: none; -webkit-touch-callout: none;}</style>
<div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Hi,
</div>
<div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;"><br>
</div>
<div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Thank
you so much for your quick responses! </div>
<div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">It
is much appreciated. </div>
<div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">I
dont have access to the cluster until next week, but I’ll be
sure to follow up on all of your suggestions and get back you
next week.
</div>
<div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;"><br>
</div>
<div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Have
a nice weekend! </div>
<div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Best
regards </div>
<div dir="ltr"
style="font-size:11.0pt;font-family:Calibri,Tahoma,Arial,Helvetica,sans-serif;color:black;">Palle
<br>
<br>
<hr>
<b>From:</b> "slurm-users"
<a class="moz-txt-link-rfc2396E" href="mailto:slurm-users-bounces@lists.schedmd.com"><slurm-users-bounces@lists.schedmd.com></a> <br>
<b>Sent:</b> 12 juli 2019 17:37 <br>
<b>To:</b> "Slurm User Community List"
<a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com"><slurm-users@lists.schedmd.com></a> <br>
<b>Subject:</b> Re: [slurm-users] Running pyMPI on several nodes
<br>
<br>
</div>
<div dir="ltr">
<div>Par, by 'poking around' Crhis means to use tools such as
netstat and lsof. </div>
<div>Also I would look as ps -eaf --forest to make sure there
are no 'orphaned' jusbs sitting on that compute node.
</div>
<div><br>
</div>
<div>Having said that though, I have a dim memory of a classic
PBSPro error message which says something about a network
connection,
</div>
<div>but really means that you cannot open a remote session on
that compute server.
</div>
<div><br>
</div>
<div>As an aside, you have checked that your username exists on
that compue server? getent passwd par
</div>
<div>Also that your home directory is mounted - or something
substituting for your home directory?
</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div class="gmail_attr" dir="ltr">On Fri, 12 Jul 2019 at 15:55,
Chris Samuel < <a href="mailto:chris@csamuel.org"
moz-do-not-send="true">
chris@csamuel.org</a>> wrote: <br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid">On
12/7/19 7:39 am, Pär Lundö wrote: <br>
<br>
> Presumably, the first 8 tasks originates from the first
node (in this <br>
> case the lxclient11), and the other node (lxclient10)
response as <br>
> predicted. <br>
<br>
That looks right, it seems the other node has two processes
fighting <br>
over the same socket and that's breaking Slurm there. <br>
<br>
> Is it neccessary to have passwordless ssh communication
alongside the <br>
> munge authentication? <br>
<br>
No, srun doesn't need (or use) that at all. <br>
<br>
> In addition I checked the slurmctld-log from both the
server and client <br>
> and found something (noted in bold): <br>
<br>
This is from the slurmd log on the client from the look of it.
<br>
<br>
> *[2019-07-12T14:57:53.771][83.0] task_p_pre_launch: Using
sched affinity <br>
> for tasks lurm.pmix.83.0: Address already in use[98]* <br>
> [2019-07-12T14:57:53.682][83.0] error: lxclient[0]
/pmix.server.c:386 <br>
> [pmix_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
<br>
> [2019-07-12T14:57:53.683][83.0] error: (null) [0]
/mpi_pmix:156 <br>
> [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR:
pmixp_stepd_init() failed <br>
<br>
That indicates that something else has grabbed the socket it
wants and <br>
that's why the setup of the MPI ranks on the second node
fails. <br>
<br>
You'll want to poke around there to see what's using it. <br>
<br>
Best of luck! <br>
Chris <br>
-- <br>
Chris Samuel : <a href="http://www.csamuel.org/"
target="_blank" rel="noreferrer" moz-do-not-send="true">
http://www.csamuel.org/</a> : Berkeley, CA, USA <br>
<br>
</blockquote>
</div>
</blockquote>
<pre class="moz-signature" cols="72">--
Hälsningar, Pär
________________________________
Pär Lundö
Forskare
Avdelningen för Ledningssystem
FOI
Totalförsvarets forskningsinstitut
164 90 Stockholm
Besöksadress:
Olau Magnus väg 33, Linköping
Tel: +46 13 37 86 01
Mob: +46 734 447 815
Vxl: +46 13 37 80 00
<a class="moz-txt-link-abbreviated" href="mailto:par.lundo@foi.se">par.lundo@foi.se</a>
<a class="moz-txt-link-abbreviated" href="http://www.foi.se">www.foi.se</a></pre>
</body>
</html>