[slurm-users] How to use a pyhon virtualenv with srun?

Mon Nov 18 06:39:26 UTC 2019

Hi Gareth, thank you for your answer,

I have thought about it too, but I think the --block option that I use  
in ray start is supposed to sleep indefinitely for this not to happen.  
However maybe this is not taken into account due to the fact that I  
use '&' at this end of the ray start command in the install_worker.sh  
script for this call to be non-blocking when I launch  
install_worker.sh with srun in the parent script?

Yann

"Williams, Gareth (IM&T, Black Mountain)" <Gareth.Williams at csiro.au> a écrit :

> Hi Yann,
>
> The remaining problem may be that the ray processes are not waited  
> on. I'm not sure, but hope this gets you looking in the right place.  
> You may need to sleep indefinitely in the scripts that run the  
> worker ray processes then when the master is finished making them  
> work, cancel the workers then exit the main script.  If you just  
> exit the main script computecanada will probably clean up for you  
> automatically - but it is polite to clean up after yourself.
>
> Gareth
>
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf  
> Of Yann Bouteiller
> Sent: Monday, 18 November 2019 1:49 PM
> To: Slurm User Community List <slurm-users at lists.schedmd.com>
> Subject: Re: [slurm-users] How to use a pyhon virtualenv with srun?
>
> Hello Brian, thank you for your answer.
>
> Actually, you are not allowed to install things in your home on  
> computecanada, this is why you need to install everything in a  
> virtualenv with pip install. Also, you have to install each  
> virtualenv in $SLURM_TMDIR which is the local drive of the node,  
> because everything else is slow, so I think I cannot share homes.
>
> Actually I succeeded at installing different virtualenvs on  
> different nodes using a script for each worker that creates a local  
> virtualenv, installs ray on it, and connects to the ray server  
> running in the virtualenv of the head node (I mean the primary node,  
> yes). I just call these scripts with srun. However, for some reason,  
> the workers seem to connect fine to the server but are detected as  
> dead after a
> while: https://groups.google.com/forum/#!topic/ray-dev/INB_zVS5PWY
>
> Yann
>
>
>
> Brian Andrus <toomuchit at gmail.com> a écrit :
>
>> I suspect when you say "head node" you mean the primary node from the
>> nodes your were allocated.
>>
>> Normally, when you use pip as a user, it installs in your home
>> directory. Are you certain all your nodes share the same homes?
>> If they are merely synched, that would not be the same. Not actually
>> sharing homes could be the cause.
>>
>> Brian Andrus
>>
>>
>> On 11/17/2019 11:24 AM, Yann Bouteiller wrote:
>>>
>>> Hello,
>>>
>>> I am trying to do this on computecanada, which is managed by slurm:
>>> https://ray.readthedocs.io/en/latest/deploying-on-slurm.html
>>>
>>> However, on computecanada, you cannot install things on nodes before
>>> the job has started, and you can only install things in a python
>>> virtualenv once the job has started.
>>>
>>> I can do:
>>>
>>> ```
>>> module load python/3.7.4
>>> source venv/bin/activate
>>> pip install ray
>>> ```
>>>
>>> in the bash script before calling everything else, but apparently
>>> this will only create-activate the virtualenv and install ray on the
>>> head node, but not on the remote nodes, so calling
>>>
>>> ```
>>> srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head
>>> --redis-port=6379 --redis-password=$redis_password & # Starting the
>>> head ```
>>>
>>> will succeed, but later calling
>>>
>>> ```
>>> for ((  i=1; i<=$worker_num; i++ ))
>>> do
>>>   node2=${nodes_array[$i]}
>>>   srun --export=ALL --nodes=1 --ntasks=1 -w $node2 ray start --block
>>> --address=$ip_head --redis-pass$
>>>   sleep 5
>>> done
>>>
>>> ```
>>>
>>> will produce the following error:
>>>
>>> ```
>>> slurmstepd: error: execve(): ray: No such file or directory
>>> srun: error: cdr768: task 0: Exited with exit code 2
>>> srun: Terminating job step 31218604.3 [2]+  Exit 2                 
>>> srun --export=ALL --nodes=1
>>> --ntasks=1 -w $node2 ray start --block --address=$ip_head
>>> --redis-password=$redis_password ```
>>>
>>> How can I tackle this issue, please? I am a beginner with slurm so I
>>> am not sure what is the problem here. Here is my whole sbatch
>>> script:
>>>
>>> ```
>>> #!/bin/bash
>>>
>>> #SBATCH --job-name=test
>>> #SBATCH --cpus-per-task=5
>>> #SBATCH --mem-per-cpu=1000M
>>> #SBATCH --nodes=3
>>> #SBATCH --tasks-per-node 1
>>>
>>> worker_num=2 # Must be one less that the total number of nodes
>>> nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the
>>> node names nodes_array=( $nodes )
>>>
>>> module load python/3.7.4
>>> source venv/bin/activate
>>> pip install ray
>>>
>>> node1=${nodes_array[0]}
>>> ip_prefix=$(srun --nodes=1 --ntasks=1 -w $node1 hostname
>>> --ip-address) # Making address
>>> suffix=':6379'
>>> ip_head=$ip_prefix$suffix
>>> redis_password=$(uuidgen)
>>> export ip_head # Exporting for latter access by trainer.py
>>>
>>> srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head
>>> --redis-port=6379 --redis-password=$redis_password & # Starting the
>>> head sleep 5
>>>
>>> for ((  i=1; i<=$worker_num; i++ ))
>>> do
>>>   node2=${nodes_array[$i]}
>>>   srun --export=ALL --nodes=1 --ntasks=1 -w $node2 ray start --block
>>> --address=$ip_head --redis-password=$redis_password & # Starting the
>>> workers
>>>   sleep 5
>>> done
>>>
>>> python -u trainer.py $redis_password 15 # Pass the total number of
>>> allocated CPUs
>>>
>>> ```
>>>
>>> ---
>>> Regards,
>>> Yann
>>>
>>>