[slurm-users] How to use a pyhon virtualenv with srun?

Yann Bouteiller yann.bouteiller at polymtl.ca
Sun Nov 17 19:24:29 UTC 2019


Hello,

I am trying to do this on computecanada, which is managed by slurm:  
https://ray.readthedocs.io/en/latest/deploying-on-slurm.html

However, on computecanada, you cannot install things on nodes before  
the job has started, and you can only install things in a python  
virtualenv once the job has started.

I can do:

```
module load python/3.7.4
source venv/bin/activate
pip install ray
```

in the bash script before calling everything else, but apparently this  
will only create-activate the virtualenv and install ray on the head  
node, but not on the remote nodes, so calling

```
srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head  
--redis-port=6379 --redis-password=$redis_password & # Starting the head
```

will succeed, but later calling

```
for ((  i=1; i<=$worker_num; i++ ))
do
   node2=${nodes_array[$i]}
   srun --export=ALL --nodes=1 --ntasks=1 -w $node2 ray start --block  
--address=$ip_head --redis-pass$
   sleep 5
done

```

will produce the following error:

```
slurmstepd: error: execve(): ray: No such file or directory
srun: error: cdr768: task 0: Exited with exit code 2
srun: Terminating job step 31218604.3
[2]+  Exit 2                  srun --export=ALL --nodes=1 --ntasks=1  
-w $node2 ray start --block --address=$ip_head  
--redis-password=$redis_password
```

How can I tackle this issue, please? I am a beginner with slurm so I  
am not sure what is the problem here. Here is my whole sbatch script:

```
#!/bin/bash

#SBATCH --job-name=test
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=1000M
#SBATCH --nodes=3
#SBATCH --tasks-per-node 1

worker_num=2 # Must be one less that the total number of nodes
nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
nodes_array=( $nodes )

module load python/3.7.4
source venv/bin/activate
pip install ray

node1=${nodes_array[0]}
ip_prefix=$(srun --nodes=1 --ntasks=1 -w $node1 hostname --ip-address)  
# Making address
suffix=':6379'
ip_head=$ip_prefix$suffix
redis_password=$(uuidgen)
export ip_head # Exporting for latter access by trainer.py

srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head  
--redis-port=6379 --redis-password=$redis_password & # Starting the head
sleep 5

for ((  i=1; i<=$worker_num; i++ ))
do
   node2=${nodes_array[$i]}
   srun --export=ALL --nodes=1 --ntasks=1 -w $node2 ray start --block  
--address=$ip_head --redis-password=$redis_password & # Starting the  
workers
   sleep 5
done

python -u trainer.py $redis_password 15 # Pass the total number of  
allocated CPUs

```

---
Regards,
Yann




More information about the slurm-users mailing list