[slurm-users] inconsistent CUDA_VISIBLE_DEVICES with srun vs sbatch

Wed May 19 18:26:30 UTC 2021

Hey folks,

Here is my setup:

slurm-20.11.4 on x86_64  running Centos 7.x with CUDA 11.1

The relevant parts of the slurm.conf  and a particular gres.conf file are:

SelectType=select/cons_res

SelectTypeParameters=CR_Core

PriorityType=priority/multifactor

GresTypes=gpu

NodeName=dlt[01-12]  Gres=gpu:8 Feature=rtx Procs=40 State=UNKNOWN

PartitionName=dlt Nodes=dlt[01-12] Default=NO Shared=Exclusive
MaxTime=4-00:00:00 State=UP DefaultTime=8:00:00

And the gres.conf file for those nodes

[root at dlt02 ~]# more /etc/slurm/gres.conf

Name=gpu File=/dev/nvidia0

Name=gpu File=/dev/nvidia1

Name=gpu File=/dev/nvidia2

Name=gpu File=/dev/nvidia3

Name=gpu File=/dev/nvidia4

Name=gpu File=/dev/nvidia5

Name=gpu File=/dev/nvidia6

Name=gpu File=/dev/nvidia7

Now for the weird part. Srun works as expected and gives me a single GPU

[tim at rc-admin01 ~]$ srun -p dlt -N 1 -w dlt02 --gres=gpu:1 -A ops --pty -u
/bin/bash

[tim at dlt02 ~]$ env | grep CUDA

*CUDA*_VISIBLE_DEVICES=0

If I submit basically the same thing with sbatch

[tim at rc-admin01 ~]$ cat sbatch.test

#!/bin/bash

#SBATCH -N 1

#SBATCH -A ops

#SBATCH -t 10

#SBATCH -p dlt

#SBATCH --gres=gpu:1

#SBATCH -w dlt02

env | grep CUDA

I get the following output.

[tim at rc-admin01 ~]$ cat slurm-28824.out

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

Any ideas of what is going on here?

Thanks in advance! This one has me stumped.
ReplyForward
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210519/3ae2de8e/attachment-0001.htm>