[slurm-users] inconsistent CUDA_VISIBLE_DEVICES with srun vs sbatch
Tim Carlson
tim.s.carlson at gmail.com
Wed May 19 18:26:30 UTC 2021
Hey folks,
Here is my setup:
slurm-20.11.4 on x86_64 running Centos 7.x with CUDA 11.1
The relevant parts of the slurm.conf and a particular gres.conf file are:
SelectType=select/cons_res
SelectTypeParameters=CR_Core
PriorityType=priority/multifactor
GresTypes=gpu
NodeName=dlt[01-12] Gres=gpu:8 Feature=rtx Procs=40 State=UNKNOWN
PartitionName=dlt Nodes=dlt[01-12] Default=NO Shared=Exclusive
MaxTime=4-00:00:00 State=UP DefaultTime=8:00:00
And the gres.conf file for those nodes
[root at dlt02 ~]# more /etc/slurm/gres.conf
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
Name=gpu File=/dev/nvidia2
Name=gpu File=/dev/nvidia3
Name=gpu File=/dev/nvidia4
Name=gpu File=/dev/nvidia5
Name=gpu File=/dev/nvidia6
Name=gpu File=/dev/nvidia7
Now for the weird part. Srun works as expected and gives me a single GPU
[tim at rc-admin01 ~]$ srun -p dlt -N 1 -w dlt02 --gres=gpu:1 -A ops --pty -u
/bin/bash
[tim at dlt02 ~]$ env | grep CUDA
*CUDA*_VISIBLE_DEVICES=0
If I submit basically the same thing with sbatch
[tim at rc-admin01 ~]$ cat sbatch.test
#!/bin/bash
#SBATCH -N 1
#SBATCH -A ops
#SBATCH -t 10
#SBATCH -p dlt
#SBATCH --gres=gpu:1
#SBATCH -w dlt02
env | grep CUDA
I get the following output.
[tim at rc-admin01 ~]$ cat slurm-28824.out
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
Any ideas of what is going on here?
Thanks in advance! This one has me stumped.
ReplyForward
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210519/3ae2de8e/attachment-0001.htm>
More information about the slurm-users
mailing list