[slurm-users] How to use a pyhon virtualenv with srun?
Yann Bouteiller
yann.bouteiller at polymtl.ca
Sun Nov 17 19:24:29 UTC 2019
Hello,
I am trying to do this on computecanada, which is managed by slurm:
https://ray.readthedocs.io/en/latest/deploying-on-slurm.html
However, on computecanada, you cannot install things on nodes before
the job has started, and you can only install things in a python
virtualenv once the job has started.
I can do:
```
module load python/3.7.4
source venv/bin/activate
pip install ray
```
in the bash script before calling everything else, but apparently this
will only create-activate the virtualenv and install ray on the head
node, but not on the remote nodes, so calling
```
srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head
--redis-port=6379 --redis-password=$redis_password & # Starting the head
```
will succeed, but later calling
```
for (( i=1; i<=$worker_num; i++ ))
do
node2=${nodes_array[$i]}
srun --export=ALL --nodes=1 --ntasks=1 -w $node2 ray start --block
--address=$ip_head --redis-pass$
sleep 5
done
```
will produce the following error:
```
slurmstepd: error: execve(): ray: No such file or directory
srun: error: cdr768: task 0: Exited with exit code 2
srun: Terminating job step 31218604.3
[2]+ Exit 2 srun --export=ALL --nodes=1 --ntasks=1
-w $node2 ray start --block --address=$ip_head
--redis-password=$redis_password
```
How can I tackle this issue, please? I am a beginner with slurm so I
am not sure what is the problem here. Here is my whole sbatch script:
```
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=1000M
#SBATCH --nodes=3
#SBATCH --tasks-per-node 1
worker_num=2 # Must be one less that the total number of nodes
nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
nodes_array=( $nodes )
module load python/3.7.4
source venv/bin/activate
pip install ray
node1=${nodes_array[0]}
ip_prefix=$(srun --nodes=1 --ntasks=1 -w $node1 hostname --ip-address)
# Making address
suffix=':6379'
ip_head=$ip_prefix$suffix
redis_password=$(uuidgen)
export ip_head # Exporting for latter access by trainer.py
srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head
--redis-port=6379 --redis-password=$redis_password & # Starting the head
sleep 5
for (( i=1; i<=$worker_num; i++ ))
do
node2=${nodes_array[$i]}
srun --export=ALL --nodes=1 --ntasks=1 -w $node2 ray start --block
--address=$ip_head --redis-password=$redis_password & # Starting the
workers
sleep 5
done
python -u trainer.py $redis_password 15 # Pass the total number of
allocated CPUs
```
---
Regards,
Yann
More information about the slurm-users
mailing list