[slurm-users] Running pyMPI on several nodes
Pär Lundö
par.lundo at foi.se
Tue Aug 13 06:49:02 UTC 2019
Hi Benson!
Yeah, it was via an NFS-share.
Best regards,
Pälle
On 2019-08-13 08:30, Benson Muite wrote:
>
> Hi Pälle!
>
> Great. It would be helpful to know how they shared the etc directory? NFS?
>
> Benson
>
> On 8/13/19 9:25 AM, Pär Lundö wrote:
>>
>> Hi!
>>
>> I have now had the chance to look into to this matter more thoroughly
>> and it seems that the problem was due to the fact that the nodes are
>> diskless and shared some data (e.g. "etc"-dir). I removed that
>> dependency and mounted each node to a unique set of folders, which
>> resolved the issue. Presumably, this can be done in other ways
>> unknown to me, but it helped me and I can now run multiple nodes via MPI.
>>
>> Thank you for your help!
>>
>> Best regards,
>> Pälle L
>>
>> On 2019-07-16 15:49, Benson Muite wrote:
>>>
>>> Hi,
>>>
>>> Does a regular MPI program run on two nodes? For example helloworld:
>>>
>>> https://people.sc.fsu.edu/~jburkardt/c_src/hello_mpi/hello_mpi.c
>>>
>>> https://people.sc.fsu.edu/~jburkardt/py_src/hello_mpi/hello_mpi.py
>>>
>>> Benson
>>>
>>> On 7/16/19 4:30 PM, Pär Lundö wrote:
>>>> Hi,
>>>> Thank you for your quick answer!
>>>> I’ll look into that, but they share the same hosts-file and the
>>>> DHCP-server sets their hostname.
>>>>
>>>> However I came across a setting in the slurm.conf-file ”Tmpfs” and
>>>> there were a note regarding it in the guide of mpi at the slurms
>>>> webpage. I implemented the proposed changes but still no luck.
>>>>
>>>> Best regards,
>>>> Palle
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* "slurm-users" <slurm-users-bounces at lists.schedmd.com>
>>>> *Sent:* 16 juli 2019 12:32
>>>> *To:* "Slurm User Community List" <slurm-users at lists.schedmd.com>
>>>> *Subject:* Re: [slurm-users] Running pyMPI on several nodes
>>>>
>>>> srun: error: Application launch failed: Invalid node name specified
>>>>
>>>> Hearns Law. All batch system problems are DNS problems.
>>>>
>>>> Seriously though - check out your name resolution both on the head
>>>> node and the compute nodes.
>>>>
>>>>
>>>> On Tue, 16 Jul 2019 at 08:49, Pär Lundö < par.lundo at foi.se
>>>> <mailto:par.lundo at foi.se>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have now had the time to look at some of your suggestions.
>>>>
>>>> First I tried running "srun -N1 hostname" via a sbatch-script,
>>>> while having two nodes up and running.
>>>> "sinfo" yields that two nodes are up and idle prior to
>>>> submitting the sbatch-script.
>>>> After submitting the job, I receive an error stating that:
>>>>
>>>> "srun: error: Task launch for 86.0 failed on node lxclient11:
>>>> Invalid node name specified.
>>>> srun: error: Application launch failed: Invalid node name specified
>>>> srun: Job step aborted: Waiting up to 32 seconds for job step
>>>> to finish.
>>>> srun: error: TImed out waiting for job step to complete"
>>>>
>>>>
>>>> From the log file at the client I get a more detailed error:
>>>> " Launching batch job 86 for UID 1000
>>>> [86.batch] error: Invalid host_index -1 for job 86
>>>> [86.batch] error: Host lxclient10 not in hostlist lxclient11
>>>> [86.batch] task_pre_launch: Using sched_affinity for tasks
>>>> rpc_launch_tasks: Invalid node list (lxclient10 not in lxclient11)"
>>>>
>>>> My two nodes are called lxclient10 and lxclient11.
>>>> Why is my batch job launched with the UID 1000, shouldnt it be
>>>> launched via the slurm-user (which in my case has the UID 64030)?
>>>> What is meant by that the different nodes are not in the
>>>> nodeslist?
>>>> The two nodes and the server share the same setup of
>>>> IP-addresses in the "/etc/hosts"-file.
>>>>
>>>> -> This was resolved due to that lxclient10 was noted as down.
>>>> Getting it back up, the submitting of the same sbatch-script,
>>>> resulted in no error.
>>>> However running it on two nodes I get an error
>>>> "srun: error: Job Step 88.0 aborted before step completely
>>>> launched.
>>>> srun: error: Job step aborted: Waiting up to 32 seconds for job
>>>> step to finish.
>>>> srun: error: task 1 launched failed: Unspecifed error
>>>> srun: error: lxclient10: task 0: Killed"
>>>>
>>>> And in the slurmctld.log-file from the client I get an error
>>>> similiar to that prevously stated, that the pmix cannot bind
>>>> UNIX socket /var/spool/slurmd/stepd.slurm.pmix.88.0: Address
>>>> already in use (98)
>>>>
>>>> I ran the lsof command, but I dont really know what I am
>>>> looking after, I can see if I grep with the different nodenames
>>>> that the two nodes have mounted the nfs-partition and that a
>>>> link is established.
>>>>
>>>> "As an aside, you have checked that your username exists on
>>>> that compue server? getent passwd par
>>>> Also that your home directory is mounted - or something
>>>> substituting for your home directory?"
>>>> Yes, the user slurm exists on both nodes and have the same uid.
>>>>
>>>> "Have you tried
>>>>
>>>>
>>>> srun -N# -n# mpirun python3 ....
>>>>
>>>>
>>>> Perhaps you have no MPI environment being setup for the
>>>> processes? There was no "--mpi" flag in your "srun" command
>>>> and we don't know if you have a default value for that or not.
>>>>
>>>> "
>>>>
>>>> In my slurm.conf-file I do specify that "MpiDefault=pmix" (And
>>>> it can be seen in the logfile that there is something wrong
>>>> with pmix, that the address already in use.)
>>>>
>>>> One thing that struck my mind now is that I run these nodes as
>>>> a pair of diskless nodes, whom boots and mounts the same
>>>> filesystem which is supplied by a server. The run differen pids
>>>> for different processes which should not affect one another(?),
>>>> right?
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Palle
>>>>
>>>> On 2019-07-12 19:34, Pär Lundö wrote:
>>>>
>>>> Hi,
>>>>
>>>> Thank you so much for your quick responses!
>>>> It is much appreciated.
>>>> I dont have access to the cluster until next week, but I’ll
>>>> be sure to follow up on all of your suggestions and get
>>>> back you next week.
>>>>
>>>> Have a nice weekend!
>>>> Best regards
>>>> Palle
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* "slurm-users"
>>>> <slurm-users-bounces at lists.schedmd.com>
>>>> <mailto:slurm-users-bounces at lists.schedmd.com>
>>>> *Sent:* 12 juli 2019 17:37
>>>> *To:* "Slurm User Community List"
>>>> <slurm-users at lists.schedmd.com>
>>>> <mailto:slurm-users at lists.schedmd.com>
>>>> *Subject:* Re: [slurm-users] Running pyMPI on several nodes
>>>>
>>>> Par, by 'poking around' Crhis means to use tools such as
>>>> netstat and lsof.
>>>> Also I would look as ps -eaf --forest to make sure there
>>>> are no 'orphaned' jusbs sitting on that compute node.
>>>>
>>>> Having said that though, I have a dim memory of a classic
>>>> PBSPro error message which says something about a network
>>>> connection,
>>>> but really means that you cannot open a remote session on
>>>> that compute server.
>>>>
>>>> As an aside, you have checked that your username exists on
>>>> that compue server? getent passwd par
>>>> Also that your home directory is mounted - or something
>>>> substituting for your home directory?
>>>>
>>>>
>>>> On Fri, 12 Jul 2019 at 15:55, Chris Samuel <
>>>> chris at csamuel.org <mailto:chris at csamuel.org>> wrote:
>>>>
>>>> On 12/7/19 7:39 am, Pär Lundö wrote:
>>>>
>>>> > Presumably, the first 8 tasks originates from the
>>>> first node (in this
>>>> > case the lxclient11), and the other node (lxclient10)
>>>> response as
>>>> > predicted.
>>>>
>>>> That looks right, it seems the other node has two
>>>> processes fighting
>>>> over the same socket and that's breaking Slurm there.
>>>>
>>>> > Is it neccessary to have passwordless ssh
>>>> communication alongside the
>>>> > munge authentication?
>>>>
>>>> No, srun doesn't need (or use) that at all.
>>>>
>>>> > In addition I checked the slurmctld-log from both the
>>>> server and client
>>>> > and found something (noted in bold):
>>>>
>>>> This is from the slurmd log on the client from the look
>>>> of it.
>>>>
>>>> > *[2019-07-12T14:57:53.771][83.0] task_p_pre_launch:
>>>> Using sched affinity
>>>> > for tasks lurm.pmix.83.0: Address already in use[98]*
>>>> > [2019-07-12T14:57:53.682][83.0] error: lxclient[0]
>>>> /pmix.server.c:386
>>>> > [pmix_stepd_init] mpi/pmix: ERROR:
>>>> pmixp_usock_create_srv
>>>> > [2019-07-12T14:57:53.683][83.0] error: (null) [0]
>>>> /mpi_pmix:156
>>>> > [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR:
>>>> pmixp_stepd_init() failed
>>>>
>>>> That indicates that something else has grabbed the
>>>> socket it wants and
>>>> that's why the setup of the MPI ranks on the second
>>>> node fails.
>>>>
>>>> You'll want to poke around there to see what's using it.
>>>>
>>>> Best of luck!
>>>> Chris
>>>> --
>>>> Chris Samuel : http://www.csamuel.org/
>>>> <http://www.csamuel.org/> : Berkeley, CA, USA
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190813/8a927e30/attachment-0001.htm>
More information about the slurm-users
mailing list