[slurm-users] How to use a pyhon virtualenv with srun?

Williams, Gareth (IM&T, Black Mountain) Gareth.Williams at csiro.au
Mon Nov 18 07:43:26 UTC 2019


Hi Yann,

This is a somewhat unusual situation (though many will have come across one variation or another). Looking harder I think you are doing the right thing, except it would be better to wait/test for the workers to be up before starting the client. I presume the 'sleep 5' commands are there hoping the works will be ready within 5s. In any case you say the master can be contacted initially - so that indicates it is ready at that point. It seems the failure is soon after.

I don't suppose there is any useful output? Perhaps you can add options for ray to be more verbose.

Aside: you could probably use srun options instead of the loop over nodes. However that doesn't really matter.  It may matter more that you don't need to sleep in that loop unless you want to stagger the start of the workers.

Gareth 

-----Original Message-----
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Yann Bouteiller
Sent: Monday, 18 November 2019 5:39 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] How to use a pyhon virtualenv with srun?

Hi Gareth, thank you for your answer,

I have thought about it too, but I think the --block option that I use in ray start is supposed to sleep indefinitely for this not to happen.  
However maybe this is not taken into account due to the fact that I use '&' at this end of the ray start command in the install_worker.sh script for this call to be non-blocking when I launch install_worker.sh with srun in the parent script?

Yann


"Williams, Gareth (IM&T, Black Mountain)" <Gareth.Williams at csiro.au> a écrit :

> Hi Yann,
>
> The remaining problem may be that the ray processes are not waited on. 
> I'm not sure, but hope this gets you looking in the right place.
> You may need to sleep indefinitely in the scripts that run the worker 
> ray processes then when the master is finished making them work, 
> cancel the workers then exit the main script.  If you just exit the 
> main script computecanada will probably clean up for you automatically 
> - but it is polite to clean up after yourself.
>
> Gareth
>
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of 
> Yann Bouteiller
> Sent: Monday, 18 November 2019 1:49 PM
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] How to use a pyhon virtualenv with srun?
>
> Hello Brian, thank you for your answer.
>
> Actually, you are not allowed to install things in your home on 
> computecanada, this is why you need to install everything in a 
> virtualenv with pip install. Also, you have to install each virtualenv 
> in $SLURM_TMDIR which is the local drive of the node, because 
> everything else is slow, so I think I cannot share homes.
>
> Actually I succeeded at installing different virtualenvs on different 
> nodes using a script for each worker that creates a local virtualenv, 
> installs ray on it, and connects to the ray server running in the 
> virtualenv of the head node (I mean the primary node, yes). I just 
> call these scripts with srun. However, for some reason, the workers 
> seem to connect fine to the server but are detected as dead after a
> while: https://groups.google.com/forum/#!topic/ray-dev/INB_zVS5PWY
>
> Yann
>
>
>
> Brian Andrus <toomuchit at gmail.com> a écrit :
>
>> I suspect when you say "head node" you mean the primary node from the 
>> nodes your were allocated.
>>
>> Normally, when you use pip as a user, it installs in your home 
>> directory. Are you certain all your nodes share the same homes?
>> If they are merely synched, that would not be the same. Not actually 
>> sharing homes could be the cause.
>>
>> Brian Andrus
>>
>>
>> On 11/17/2019 11:24 AM, Yann Bouteiller wrote:
>>>
>>> Hello,
>>>
>>> I am trying to do this on computecanada, which is managed by slurm:
>>> https://ray.readthedocs.io/en/latest/deploying-on-slurm.html
>>>
>>> However, on computecanada, you cannot install things on nodes before 
>>> the job has started, and you can only install things in a python 
>>> virtualenv once the job has started.
>>>
>>> I can do:
>>>
>>> ```
>>> module load python/3.7.4
>>> source venv/bin/activate
>>> pip install ray
>>> ```
>>>
>>> in the bash script before calling everything else, but apparently 
>>> this will only create-activate the virtualenv and install ray on the 
>>> head node, but not on the remote nodes, so calling
>>>
>>> ```
>>> srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head
>>> --redis-port=6379 --redis-password=$redis_password & # Starting the 
>>> head ```
>>>
>>> will succeed, but later calling
>>>
>>> ```
>>> for ((  i=1; i<=$worker_num; i++ ))
>>> do
>>>   node2=${nodes_array[$i]}
>>>   srun --export=ALL --nodes=1 --ntasks=1 -w $node2 ray start --block 
>>> --address=$ip_head --redis-pass$
>>>   sleep 5
>>> done
>>>
>>> ```
>>>
>>> will produce the following error:
>>>
>>> ```
>>> slurmstepd: error: execve(): ray: No such file or directory
>>> srun: error: cdr768: task 0: Exited with exit code 2
>>> srun: Terminating job step 31218604.3 [2]+  Exit 2 srun --export=ALL 
>>> --nodes=1
>>> --ntasks=1 -w $node2 ray start --block --address=$ip_head 
>>> --redis-password=$redis_password ```
>>>
>>> How can I tackle this issue, please? I am a beginner with slurm so I 
>>> am not sure what is the problem here. Here is my whole sbatch
>>> script:
>>>
>>> ```
>>> #!/bin/bash
>>>
>>> #SBATCH --job-name=test
>>> #SBATCH --cpus-per-task=5
>>> #SBATCH --mem-per-cpu=1000M
>>> #SBATCH --nodes=3
>>> #SBATCH --tasks-per-node 1
>>>
>>> worker_num=2 # Must be one less that the total number of nodes 
>>> nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the 
>>> node names nodes_array=( $nodes )
>>>
>>> module load python/3.7.4
>>> source venv/bin/activate
>>> pip install ray
>>>
>>> node1=${nodes_array[0]}
>>> ip_prefix=$(srun --nodes=1 --ntasks=1 -w $node1 hostname
>>> --ip-address) # Making address
>>> suffix=':6379'
>>> ip_head=$ip_prefix$suffix
>>> redis_password=$(uuidgen)
>>> export ip_head # Exporting for latter access by trainer.py
>>>
>>> srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head
>>> --redis-port=6379 --redis-password=$redis_password & # Starting the 
>>> head sleep 5
>>>
>>> for ((  i=1; i<=$worker_num; i++ ))
>>> do
>>>   node2=${nodes_array[$i]}
>>>   srun --export=ALL --nodes=1 --ntasks=1 -w $node2 ray start --block 
>>> --address=$ip_head --redis-password=$redis_password & # Starting the 
>>> workers
>>>   sleep 5
>>> done
>>>
>>> python -u trainer.py $redis_password 15 # Pass the total number of 
>>> allocated CPUs
>>>
>>> ```
>>>
>>> ---
>>> Regards,
>>> Yann
>>>
>>>






More information about the slurm-users mailing list