[slurm-users] Running pyMPI on several nodes

John Hearns hearnsj at googlemail.com
Tue Jul 16 10:32:47 UTC 2019


srun: error: Application launch failed: Invalid node name specified

Hearns Law. All batch system problems are DNS problems.

Seriously though - check out your name resolution both on the head node and
the compute nodes.


On Tue, 16 Jul 2019 at 08:49, Pär Lundö <par.lundo at foi.se> wrote:

> Hi,
>
> I have now had the time to look at some of your suggestions.
>
> First I tried running "srun -N1 hostname" via a sbatch-script, while
> having two nodes up and running.
> "sinfo" yields that two nodes are up and idle prior to submitting the
> sbatch-script.
> After submitting the job, I receive an error stating that:
>
> "srun: error: Task launch for 86.0 failed on node lxclient11: Invalid node
> name specified.
> srun: error: Application launch failed: Invalid node name specified
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: TImed out waiting for job step to complete"
>
>
> From the log file at the client I get a more detailed error:
> " Launching batch job 86 for UID 1000
> [86.batch] error: Invalid host_index -1 for job 86
> [86.batch] error: Host lxclient10 not in hostlist lxclient11
> [86.batch] task_pre_launch: Using sched_affinity for tasks
> rpc_launch_tasks: Invalid node list (lxclient10 not in lxclient11)"
>
> My two nodes are called lxclient10 and lxclient11.
> Why is my batch job launched with the UID 1000, shouldnt it be launched
> via the slurm-user (which in my case has the UID 64030)?
> What is meant by that the different nodes are not in the nodeslist?
> The two nodes and the server share the same setup of IP-addresses in the
> "/etc/hosts"-file.
>
> -> This was resolved due to that lxclient10 was noted as down. Getting it
> back up, the submitting of the same sbatch-script, resulted in no error.
> However running it on two nodes I get an error
> "srun: error: Job Step 88.0 aborted before step completely launched.
> srun: error: Job step aborted: Waiting up to 32 seconds for job step to
> finish.
> srun: error: task 1 launched failed: Unspecifed error
> srun: error: lxclient10: task 0: Killed"
>
> And in the slurmctld.log-file from the client I get an error similiar to
> that prevously stated, that the pmix cannot bind UNIX socket
> /var/spool/slurmd/stepd.slurm.pmix.88.0: Address already in use (98)
>
> I ran the lsof command, but I dont really know what I am looking after, I
> can see if I grep with the different nodenames that the two nodes have
> mounted the nfs-partition and that a link is established.
>
> "As an aside, you have checked that your username exists on that compue
> server?      getent passwd par
> Also that your home directory is mounted - or something substituting for
> your home directory?"
> Yes, the user slurm exists on both nodes and have the same uid.
>
> "Have you tried
>
>
>         srun -N# -n# mpirun python3 ....
>
>
> Perhaps you have no MPI environment being setup for the processes?  There
> was no "--mpi" flag in your "srun" command and we don't know if you have a
> default value for that or not.
> "
>
> In my slurm.conf-file I do specify that "MpiDefault=pmix" (And it can be
> seen in the logfile that there is something wrong with pmix, that the
> address already in use.)
>
> One thing that struck my mind now is that I run these nodes as a pair of
> diskless nodes, whom boots and mounts the same filesystem which is supplied
> by a server. The run differen pids for different processes which should not
> affect one another(?), right?
>
>
> Best regards,
>
> Palle
> On 2019-07-12 19:34, Pär Lundö wrote:
>
> Hi,
>
> Thank you so much for your quick responses!
> It is much appreciated.
> I dont have access to the cluster until next week, but I’ll be sure to
> follow up on all of your suggestions and get back you next week.
>
> Have a nice weekend!
> Best regards
> Palle
>
> ------------------------------
> *From:* "slurm-users" <slurm-users-bounces at lists.schedmd.com>
> <slurm-users-bounces at lists.schedmd.com>
> *Sent:* 12 juli 2019 17:37
> *To:* "Slurm User Community List" <slurm-users at lists.schedmd.com>
> <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] Running pyMPI on several nodes
>
> Par, by 'poking around' Crhis means to use tools such as netstat and lsof.
> Also I would look as ps -eaf --forest to make sure there are no 'orphaned'
> jusbs sitting on that compute node.
>
> Having said that though, I have a dim memory of a classic PBSPro error
> message which says something about a network connection,
> but really means that you cannot open a remote session on that compute
> server.
>
> As an aside, you have checked that your username exists on that compue
> server?      getent passwd par
> Also that your home directory is mounted - or something substituting for
> your home directory?
>
>
> On Fri, 12 Jul 2019 at 15:55, Chris Samuel < chris at csamuel.org> wrote:
>
>> On 12/7/19 7:39 am, Pär Lundö wrote:
>>
>> > Presumably, the first 8 tasks originates from the first node (in this
>> > case the lxclient11), and the other node (lxclient10) response as
>> > predicted.
>>
>> That looks right, it seems the other node has two processes fighting
>> over the same socket and that's breaking Slurm there.
>>
>> > Is it neccessary to have passwordless ssh communication alongside the
>> > munge authentication?
>>
>> No, srun doesn't need (or use) that at all.
>>
>> > In addition I checked the slurmctld-log from both the server and client
>> > and found something (noted in bold):
>>
>> This is from the slurmd log on the client from the look of it.
>>
>> > *[2019-07-12T14:57:53.771][83.0] task_p_pre_launch: Using sched
>> affinity
>> > for tasks lurm.pmix.83.0: Address already in use[98]*
>> > [2019-07-12T14:57:53.682][83.0] error: lxclient[0] /pmix.server.c:386
>> > [pmix_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
>> > [2019-07-12T14:57:53.683][83.0] error: (null) [0] /mpi_pmix:156
>> > [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init()
>> failed
>>
>> That indicates that something else has grabbed the socket it wants and
>> that's why the setup of the MPI ranks on the second node fails.
>>
>> You'll want to poke around there to see what's using it.
>>
>> Best of luck!
>> Chris
>> --
>>   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>>
>> --
> Hälsningar, Pär
> ________________________________
> Pär Lundö
> Forskare
> Avdelningen för Ledningssystem
>
> FOI
> Totalförsvarets forskningsinstitut
> 164 90 Stockholm
>
> Besöksadress:
> Olau Magnus väg 33, Linköping
>
>
> Tel: +46 13 37 86 01
> Mob: +46 734 447 815
> Vxl: +46 13 37 80 00par.lundo at foi.sewww.foi.se
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190716/b0ff9f5e/attachment-0001.htm>


More information about the slurm-users mailing list