[slurm-users] inconsistent CUDA_VISIBLE_DEVICES with srun vs sbatch

Tim Carlson tim.s.carlson at gmail.com
Wed May 19 20:41:03 UTC 2021


As a follow-up,  we did figure out that if we set the partition to not be
exclusive we get something that seems more reasonable.

That is to say that if I use a partition like this

PartitionName=dlt_shared Nodes=dlt[01-12] Default=NO Shared=YES
MaxTime=4-00:00:00 State=UP DefaultTime=8:00:00


with "shared=yes" then both sbatch and srun produce the expected results of
returning the correct value of CUDA_VISIBLE_DEVICES based on what I ask
for.


It appears I should be switching to using Oversubscribe= instead of Shared=
so I will play with that when I can, but I still don't understand how with
"shared=exclusive" srun gives one result and sbatch gives another.


Tim

On Wed, May 19, 2021 at 11:26 AM Tim Carlson <tim.s.carlson at gmail.com>
wrote:

> Hey folks,
>
> Here is my setup:
>
> slurm-20.11.4 on x86_64  running Centos 7.x with CUDA 11.1
>
> The relevant parts of the slurm.conf  and a particular gres.conf file are:
>
> SelectType=select/cons_res
>
> SelectTypeParameters=CR_Core
>
> PriorityType=priority/multifactor
>
> GresTypes=gpu
>
>
> NodeName=dlt[01-12]  Gres=gpu:8 Feature=rtx Procs=40 State=UNKNOWN
>
> PartitionName=dlt Nodes=dlt[01-12] Default=NO Shared=Exclusive
> MaxTime=4-00:00:00 State=UP DefaultTime=8:00:00
>
>
> And the gres.conf file for those nodes
>
>
> [root at dlt02 ~]# more /etc/slurm/gres.conf
>
> Name=gpu File=/dev/nvidia0
>
> Name=gpu File=/dev/nvidia1
>
> Name=gpu File=/dev/nvidia2
>
> Name=gpu File=/dev/nvidia3
>
> Name=gpu File=/dev/nvidia4
>
> Name=gpu File=/dev/nvidia5
>
> Name=gpu File=/dev/nvidia6
>
> Name=gpu File=/dev/nvidia7
>
>
> Now for the weird part. Srun works as expected and gives me a single GPU
>
>
> [tim at rc-admin01 ~]$ srun -p dlt -N 1 -w dlt02 --gres=gpu:1 -A ops --pty
> -u /bin/bash
>
> [tim at dlt02 ~]$ env | grep CUDA
>
> *CUDA*_VISIBLE_DEVICES=0
>
>
> If I submit basically the same thing with sbatch
>
>
> [tim at rc-admin01 ~]$ cat sbatch.test
>
> #!/bin/bash
>
> #SBATCH -N 1
>
> #SBATCH -A ops
>
> #SBATCH -t 10
>
> #SBATCH -p dlt
>
> #SBATCH --gres=gpu:1
>
> #SBATCH -w dlt02
>
> env | grep CUDA
>
>
> I get the following output.
>
>
> [tim at rc-admin01 ~]$ cat slurm-28824.out
>
> CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
>
>
>
> Any ideas of what is going on here?
>
>
> Thanks in advance! This one has me stumped.
> ReplyForward
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210519/303fe033/attachment.htm>


More information about the slurm-users mailing list