[slurm-users] inconsistent CUDA_VISIBLE_DEVICES with srun vs sbatch

Tim Carlson tim.s.carlson at gmail.com
Wed May 19 18:26:30 UTC 2021

Hey folks,

Here is my setup:

slurm-20.11.4 on x86_64  running Centos 7.x with CUDA 11.1

The relevant parts of the slurm.conf  and a particular gres.conf file are:





NodeName=dlt[01-12]  Gres=gpu:8 Feature=rtx Procs=40 State=UNKNOWN

PartitionName=dlt Nodes=dlt[01-12] Default=NO Shared=Exclusive
MaxTime=4-00:00:00 State=UP DefaultTime=8:00:00

And the gres.conf file for those nodes

[root at dlt02 ~]# more /etc/slurm/gres.conf

Name=gpu File=/dev/nvidia0

Name=gpu File=/dev/nvidia1

Name=gpu File=/dev/nvidia2

Name=gpu File=/dev/nvidia3

Name=gpu File=/dev/nvidia4

Name=gpu File=/dev/nvidia5

Name=gpu File=/dev/nvidia6

Name=gpu File=/dev/nvidia7

Now for the weird part. Srun works as expected and gives me a single GPU

[tim at rc-admin01 ~]$ srun -p dlt -N 1 -w dlt02 --gres=gpu:1 -A ops --pty -u

[tim at dlt02 ~]$ env | grep CUDA


If I submit basically the same thing with sbatch

[tim at rc-admin01 ~]$ cat sbatch.test



#SBATCH -A ops

#SBATCH -t 10

#SBATCH -p dlt

#SBATCH --gres=gpu:1

#SBATCH -w dlt02

env | grep CUDA

I get the following output.

[tim at rc-admin01 ~]$ cat slurm-28824.out


Any ideas of what is going on here?

Thanks in advance! This one has me stumped.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210519/3ae2de8e/attachment-0001.htm>

More information about the slurm-users mailing list