[slurm-users] [External] Re: serious bug about CUDA_VISBLE_DEVICES in the slurm 17.11.7

Thu Aug 30 10:02:57 MDT 2018

Chaofeng,  I agree with what Chris says. You should be using cgroups.

I did a lot of work with cgroups anf GPUs in PBSPro (yes I know...
splitter!)
With cgroups you only get access to the devices which are allocated to that
cgroup, and you get CUDA_VISIBLE_DEVICES set for you.

Remember also to look at the permissions on /dev/nvidia(0,1,2...)   - which
are usually OK
and on /dev/nvidiactl

On Thu, 30 Aug 2018 at 15:52, Renfro, Michael <Renfro at tntech.edu> wrote:

> Chris’ method will set CUDA_VISIBLE_DEVICES like you’re used to, and it
> will help keep you or your users from picking conflicting devices.
>
> My cgroup/GPU settings from slurm.conf:
>
> =====
>
> [renfro at login ~]$ egrep -i '(cgroup|gpu)' /etc/slurm/slurm.conf | grep -v
> '^#'
> ProctrackType=proctrack/cgroup
> TaskPlugin=task/affinity,task/cgroup
> NodeName=gpunode[001-004]  CoresPerSocket=14 RealMemory=126000 Sockets=2
> ThreadsPerCore=1 Gres=gpu:2
> PartitionName=gpu Default=NO MinNodes=1 DefaultTime=1-00:00:00
> MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1
> DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0
> PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL
> LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP
> Nodes=gpunode[001-004]
> PartitionName=gpu-debug Default=NO MinNodes=1 MaxTime=00:30:00
> AllowGroups=ALL PriorityJobFactor=2 PriorityTier=1 DisableRootJobs=NO
> RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO
> DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO
> OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]
> PartitionName=gpu-interactive Default=NO MinNodes=1 MaxNodes=2
> MaxTime=02:00:00 AllowGroups=ALL PriorityJobFactor=3 PriorityTier=1
> DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0
> PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL
> LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP
> Nodes=gpunode[001-004]
> GresTypes=gpu,mic
>
> =====
>
> Example (where srun is a function that runs “srun --pty $SHELL -I”), with
> no CUDA_VISIBLE_DEVICES on the submit host, but is correctly set on
> reserving GPUs:
>
> =====
>
> [renfro at login ~]$ echo $CUDA_VISIBLE_DEVICES
>
> [renfro at login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:1
> [renfro at gpunode003 ~]$ echo $CUDA_VISIBLE_DEVICES
> 0
> [renfro at login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:2
> [renfro at gpunode004 ~]$ echo $CUDA_VISIBLE_DEVICES
> 0,1
>
> =====
>
> > On Aug 30, 2018, at 4:18 AM, Chaofeng Zhang <zhangcf1 at lenovo.com> wrote:
> >
> > CUDA_VISBLE_DEVICES is used by many AI framework to determine which gpu
> to use, like tensorflow. So this environment is critical to us.
> >
> > -----Original Message-----
> > From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of
> Chris Samuel
> > Sent: Thursday, August 30, 2018 4:42 PM
> > To: slurm-users at lists.schedmd.com
> > Subject: [External] Re: [slurm-users] serious bug about
> CUDA_VISBLE_DEVICES in the slurm 17.11.7
> >
> > On Thursday, 30 August 2018 6:38:08 PM AEST Chaofeng Zhang wrote:
> >
> >> The CUDA_VISBLE_DEVICES can't be set NoDevFiles in Slurm 17.11.7.
> >> This is worked when we use Slurm 17.02.
> >
> > You probably should be using cgroups instead to constrain access to
> GPUs.
> > Then it doesn't matter what you set CUDA_VISBLE_DEVICES to be as
> processes will only be able to access what they requested.
> >
> > Hope that helps!
> > Chris
> > --
> > Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
> >
> >
> >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180830/8f66ff40/attachment-0001.html>