[slurm-users] How to use a pyhon virtualenv with srun?

Sun Nov 17 22:56:28 UTC 2019

I suspect when you say "head node" you mean the primary node from the 
nodes your were allocated.

Normally, when you use pip as a user, it installs in your home 
directory. Are you certain all your nodes share the same homes?
If they are merely synched, that would not be the same. Not actually 
sharing homes could be the cause.

Brian Andrus

On 11/17/2019 11:24 AM, Yann Bouteiller wrote:
>
> Hello,
>
> I am trying to do this on computecanada, which is managed by slurm: 
> https://ray.readthedocs.io/en/latest/deploying-on-slurm.html
>
> However, on computecanada, you cannot install things on nodes before 
> the job has started, and you can only install things in a python 
> virtualenv once the job has started.
>
> I can do:
>
> ```
> module load python/3.7.4
> source venv/bin/activate
> pip install ray
> ```
>
> in the bash script before calling everything else, but apparently this 
> will only create-activate the virtualenv and install ray on the head 
> node, but not on the remote nodes, so calling
>
> ```
> srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head 
> --redis-port=6379 --redis-password=$redis_password & # Starting the head
> ```
>
> will succeed, but later calling
>
> ```
> for ((  i=1; i<=$worker_num; i++ ))
> do
>   node2=${nodes_array[$i]}
>   srun --export=ALL --nodes=1 --ntasks=1 -w $node2 ray start --block 
> --address=$ip_head --redis-pass$
>   sleep 5
> done
>
> ```
>
> will produce the following error:
>
> ```
> slurmstepd: error: execve(): ray: No such file or directory
> srun: error: cdr768: task 0: Exited with exit code 2
> srun: Terminating job step 31218604.3
> [2]+  Exit 2                  srun --export=ALL --nodes=1 --ntasks=1 
> -w $node2 ray start --block --address=$ip_head 
> --redis-password=$redis_password
> ```
>
> How can I tackle this issue, please? I am a beginner with slurm so I 
> am not sure what is the problem here. Here is my whole sbatch script:
>
> ```
> #!/bin/bash
>
> #SBATCH --job-name=test
> #SBATCH --cpus-per-task=5
> #SBATCH --mem-per-cpu=1000M
> #SBATCH --nodes=3
> #SBATCH --tasks-per-node 1
>
> worker_num=2 # Must be one less that the total number of nodes
> nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the 
> node names
> nodes_array=( $nodes )
>
> module load python/3.7.4
> source venv/bin/activate
> pip install ray
>
> node1=${nodes_array[0]}
> ip_prefix=$(srun --nodes=1 --ntasks=1 -w $node1 hostname --ip-address) 
> # Making address
> suffix=':6379'
> ip_head=$ip_prefix$suffix
> redis_password=$(uuidgen)
> export ip_head # Exporting for latter access by trainer.py
>
> srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head 
> --redis-port=6379 --redis-password=$redis_password & # Starting the head
> sleep 5
>
> for ((  i=1; i<=$worker_num; i++ ))
> do
>   node2=${nodes_array[$i]}
>   srun --export=ALL --nodes=1 --ntasks=1 -w $node2 ray start --block 
> --address=$ip_head --redis-password=$redis_password & # Starting the 
> workers
>   sleep 5
> done
>
> python -u trainer.py $redis_password 15 # Pass the total number of 
> allocated CPUs
>
> ```
>
> ---
> Regards,
> Yann
>
>