[slurm-users] [External] Re: serious bug about CUDA_VISBLE_DEVICES in the slurm 17.11.7

Chaofeng Zhang zhangcf1 at lenovo.com
Thu Aug 30 09:48:33 MDT 2018


$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=none -p GPU /usr/bin/env |grep CUDA
CUDA_VISIBLE_DEVICES=0,1

This result should be CUDA_VISIBLE_DEVICES=NoDevFiles, and it really is NoDevFiles in 17.02. So this must be a bug in 17.11.7.


From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Brian W. Johanson
Sent: Thursday, August 30, 2018 11:23 PM
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] [External] Re: serious bug about CUDA_VISBLE_DEVICES in the slurm 17.11.7


and to answer "CUDA_VISBLE_DEVICES can't be set NoDevFiles in Slurm 17.11.7"

CUDA_VISIBLE_DEVICES is unset if --gres=none and if set in the user's environment, it will remains set to whatever.  If you want really want to see NoDevFIles, set it in /etc/profile.d, it will get clobbered when the resources are actually there.



$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=none -p GPU /usr/bin/env |grep CUDA
CUDA_VISIBLE_DEVICES=0,1
$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=none -p GPU nvidia-smi
No devices were found


$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:1 -p GPU /usr/bin/env |grep CUDA
CUDA_VISIBLE_DEVICES=0
$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:1 -p GPU nvidia-smi |grep Tesla | wc
      1      11      80
$


$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:2 -p GPU /usr/bin/env |grep CUDA
CUDA_VISIBLE_DEVICES=0,1
$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:2 -p GPU nvidia-smi |grep Tesla | wc
      2      22     160
$



On 08/30/2018 10:48 AM, Renfro, Michael wrote:

Chris’ method will set CUDA_VISIBLE_DEVICES like you’re used to, and it will help keep you or your users from picking conflicting devices.



My cgroup/GPU settings from slurm.conf:



=====



[renfro at login ~]$ egrep -i '(cgroup|gpu)' /etc/slurm/slurm.conf | grep -v '^#'

ProctrackType=proctrack/cgroup

TaskPlugin=task/affinity,task/cgroup

NodeName=gpunode[001-004]  CoresPerSocket=14 RealMemory=126000 Sockets=2 ThreadsPerCore=1 Gres=gpu:2

PartitionName=gpu Default=NO MinNodes=1 DefaultTime=1-00:00:00 MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]

PartitionName=gpu-debug Default=NO MinNodes=1 MaxTime=00:30:00 AllowGroups=ALL PriorityJobFactor=2 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]

PartitionName=gpu-interactive Default=NO MinNodes=1 MaxNodes=2 MaxTime=02:00:00 AllowGroups=ALL PriorityJobFactor=3 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]

GresTypes=gpu,mic



=====



Example (where srun is a function that runs “srun --pty $SHELL -I”), with no CUDA_VISIBLE_DEVICES on the submit host, but is correctly set on reserving GPUs:



=====



[renfro at login ~]$ echo $CUDA_VISIBLE_DEVICES



[renfro at login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:1

[renfro at gpunode003 ~]$ echo $CUDA_VISIBLE_DEVICES

0

[renfro at login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:2

[renfro at gpunode004 ~]$ echo $CUDA_VISIBLE_DEVICES

0,1



=====



On Aug 30, 2018, at 4:18 AM, Chaofeng Zhang <zhangcf1 at lenovo.com><mailto:zhangcf1 at lenovo.com> wrote:



CUDA_VISBLE_DEVICES is used by many AI framework to determine which gpu to use, like tensorflow. So this environment is critical to us.



-----Original Message-----

From: slurm-users <slurm-users-bounces at lists.schedmd.com><mailto:slurm-users-bounces at lists.schedmd.com> On Behalf Of Chris Samuel

Sent: Thursday, August 30, 2018 4:42 PM

To: slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>

Subject: [External] Re: [slurm-users] serious bug about CUDA_VISBLE_DEVICES in the slurm 17.11.7



On Thursday, 30 August 2018 6:38:08 PM AEST Chaofeng Zhang wrote:



The CUDA_VISBLE_DEVICES can't be set NoDevFiles in Slurm 17.11.7.

This is worked when we use Slurm 17.02.



You probably should be using cgroups instead to constrain access to GPUs.

Then it doesn't matter what you set CUDA_VISBLE_DEVICES to be as processes will only be able to access what they requested.



Hope that helps!

Chris

--

Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC













-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180830/f46032e1/attachment-0001.html>


More information about the slurm-users mailing list