[slurm-users] Running pyMPI on several nodes

Tue Jul 16 13:49:54 UTC 2019

Hi,

Does a regular MPI program run on two nodes? For example helloworld:

https://people.sc.fsu.edu/~jburkardt/c_src/hello_mpi/hello_mpi.c

https://people.sc.fsu.edu/~jburkardt/py_src/hello_mpi/hello_mpi.py

Benson

On 7/16/19 4:30 PM, Pär Lundö wrote:
> Hi,
> Thank you for your quick answer!
> I’ll look into that, but they share the same hosts-file and the 
> DHCP-server sets their hostname.
>
> However I came across a setting in the slurm.conf-file ”Tmpfs” and 
> there were a note regarding it in the guide of mpi at the slurms 
> webpage. I implemented the proposed changes but still no luck.
>
> Best regards,
> Palle
>
> ------------------------------------------------------------------------
> *From:* "slurm-users" <slurm-users-bounces at lists.schedmd.com>
> *Sent:* 16 juli 2019 12:32
> *To:* "Slurm User Community List" <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] Running pyMPI on several nodes
>
> srun: error: Application launch failed: Invalid node name specified
>
> Hearns Law. All batch system problems are DNS problems.
>
> Seriously though - check out your name resolution both on the head 
> node and the compute nodes.
>
>
> On Tue, 16 Jul 2019 at 08:49, Pär Lundö < par.lundo at foi.se 
> <mailto:par.lundo at foi.se>> wrote:
>
>     Hi,
>
>     I have now had the time to look at some of your suggestions.
>
>     First I tried running "srun -N1 hostname" via a sbatch-script,
>     while having two nodes up and running.
>     "sinfo" yields that two nodes are up and idle prior to submitting
>     the sbatch-script.
>     After submitting the job, I receive an error stating that:
>
>     "srun: error: Task launch for 86.0 failed on node lxclient11:
>     Invalid node name specified.
>     srun: error: Application launch failed: Invalid node name specified
>     srun: Job step aborted: Waiting up to 32 seconds for job step to
>     finish.
>     srun: error: TImed out waiting for job step to complete"
>
>
>     From the log file at the client I get a more detailed error:
>     " Launching batch job 86 for UID 1000
>     [86.batch] error: Invalid host_index -1 for job 86
>     [86.batch] error: Host lxclient10 not in hostlist lxclient11
>     [86.batch] task_pre_launch: Using sched_affinity for tasks
>     rpc_launch_tasks: Invalid node list (lxclient10 not in lxclient11)"
>
>     My two nodes are called lxclient10 and lxclient11.
>     Why is my batch job launched with the UID 1000, shouldnt it be
>     launched via the slurm-user (which in my case has the UID 64030)?
>     What is meant by that the different nodes are not in the nodeslist?
>     The two nodes and the server share the same setup of IP-addresses
>     in the "/etc/hosts"-file.
>
>     -> This was resolved due to that lxclient10 was noted as down.
>     Getting it back up, the submitting of the same sbatch-script,
>     resulted in no error.
>     However running it on two nodes I get an error
>     "srun: error: Job Step 88.0 aborted before step completely launched.
>     srun: error: Job step aborted: Waiting up to 32 seconds for job
>     step to finish.
>     srun: error: task 1 launched failed: Unspecifed error
>     srun: error: lxclient10: task 0: Killed"
>
>     And in the slurmctld.log-file from the client I get an error
>     similiar to that prevously stated, that the pmix cannot bind UNIX
>     socket /var/spool/slurmd/stepd.slurm.pmix.88.0: Address already in
>     use (98)
>
>     I ran the lsof command, but I dont really know what I am looking
>     after, I can see if I grep with the different nodenames that the
>     two nodes have mounted the nfs-partition and that a link is
>     established.
>
>     "As an aside, you have checked that your username exists on that
>     compue server?      getent passwd par
>     Also that your home directory is mounted - or something
>     substituting for your home directory?"
>     Yes, the user slurm exists on both nodes and have the same uid.
>
>     "Have you tried
>
>
>             srun -N# -n# mpirun python3 ....
>
>
>     Perhaps you have no MPI environment being setup for the processes?
>      There was no "--mpi" flag in your "srun" command and we don't
>     know if you have a default value for that or not.
>
>     "
>
>     In my slurm.conf-file I do specify that "MpiDefault=pmix" (And it
>     can be seen in the logfile that there is something wrong with
>     pmix, that the address already in use.)
>
>     One thing that struck my mind now is that I run these nodes as a
>     pair of diskless nodes, whom boots and mounts the same filesystem
>     which is supplied by a server. The run differen pids for different
>     processes which should not affect one another(?), right?
>
>
>     Best regards,
>
>     Palle
>
>     On 2019-07-12 19:34, Pär Lundö wrote:
>
>         Hi,
>
>         Thank you so much for your quick responses!
>         It is much appreciated.
>         I dont have access to the cluster until next week, but I’ll be
>         sure to follow up on all of your suggestions and get back you
>         next week.
>
>         Have a nice weekend!
>         Best regards
>         Palle
>
>         ------------------------------------------------------------------------
>         *From:* "slurm-users" <slurm-users-bounces at lists.schedmd.com>
>         <mailto:slurm-users-bounces at lists.schedmd.com>
>         *Sent:* 12 juli 2019 17:37
>         *To:* "Slurm User Community List"
>         <slurm-users at lists.schedmd.com>
>         <mailto:slurm-users at lists.schedmd.com>
>         *Subject:* Re: [slurm-users] Running pyMPI on several nodes
>
>         Par, by 'poking around' Crhis means to use tools such as
>         netstat and lsof.
>         Also I would look as ps -eaf --forest to make sure there are
>         no 'orphaned' jusbs sitting on that compute node.
>
>         Having said that though, I have a dim memory of a classic
>         PBSPro error message which says something about a network
>         connection,
>         but really means that you cannot open a remote session on that
>         compute server.
>
>         As an aside, you have checked that your username exists on
>         that compue server?      getent passwd par
>         Also that your home directory is mounted - or something
>         substituting for your home directory?
>
>
>         On Fri, 12 Jul 2019 at 15:55, Chris Samuel < chris at csamuel.org
>         <mailto:chris at csamuel.org>> wrote:
>
>             On 12/7/19 7:39 am, Pär Lundö wrote:
>
>             > Presumably, the first 8 tasks originates from the first
>             node (in this
>             > case the lxclient11), and the other node (lxclient10)
>             response as
>             > predicted.
>
>             That looks right, it seems the other node has two
>             processes fighting
>             over the same socket and that's breaking Slurm there.
>
>             > Is it neccessary to have passwordless ssh communication
>             alongside the
>             > munge authentication?
>
>             No, srun doesn't need (or use) that at all.
>
>             > In addition I checked the slurmctld-log from both the
>             server and client
>             > and found something (noted in bold):
>
>             This is from the slurmd log on the client from the look of
>             it.
>
>             > *[2019-07-12T14:57:53.771][83.0] task_p_pre_launch:
>             Using sched affinity
>             > for tasks lurm.pmix.83.0: Address already in use[98]*
>             > [2019-07-12T14:57:53.682][83.0] error: lxclient[0]
>             /pmix.server.c:386
>             > [pmix_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
>             > [2019-07-12T14:57:53.683][83.0] error: (null) [0]
>             /mpi_pmix:156
>             > [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR:
>             pmixp_stepd_init() failed
>
>             That indicates that something else has grabbed the socket
>             it wants and
>             that's why the setup of the MPI ranks on the second node
>             fails.
>
>             You'll want to poke around there to see what's using it.
>
>             Best of luck!
>             Chris
>             -- 
>               Chris Samuel  : http://www.csamuel.org/
>             <http://www.csamuel.org/>  :  Berkeley, CA, USA
>
>     -- 
>     Hälsningar, Pär
>     ________________________________
>     Pär Lundö
>     Forskare
>     Avdelningen för Ledningssystem
>
>     FOI
>     Totalförsvarets forskningsinstitut
>     164 90 Stockholm
>
>     Besöksadress:
>     Olau Magnus väg 33, Linköping
>
>
>     Tel:+46 13 37 86 01  <tel:+46%2013%2037%2086%2001>
>     Mob:+46 734 447 815  <tel:+46%20734%20447%20815>
>     Vxl:+46 13 37 80 00  <tel:+46%2013%2037%2080%2000>
>     par.lundo at foi.se  <mailto:par.lundo at foi.se>
>     www.foi.se  <http://www.foi.se>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190716/115393dd/attachment.htm>