[slurm-users] [External] Re: serious bug about CUDA_VISBLE_DEVICES in the slurm 17.11.7

John Hearns hearnsj at googlemail.com
Thu Aug 30 10:03:57 MDT 2018


I also remember there being write-only permissions involved when working
with cgroups and devices .. which bent my head slightly..

On Thu, 30 Aug 2018 at 17:02, John Hearns <hearnsj at googlemail.com> wrote:

> Chaofeng,  I agree with what Chris says. You should be using cgroups.
>
> I did a lot of work with cgroups anf GPUs in PBSPro (yes I know...
> splitter!)
> With cgroups you only get access to the devices which are allocated to
> that cgroup, and you get CUDA_VISIBLE_DEVICES set for you.
>
> Remember also to look at the permissions on /dev/nvidia(0,1,2...)   -
> which are usually OK
> and on /dev/nvidiactl
>
>
>
>
> On Thu, 30 Aug 2018 at 15:52, Renfro, Michael <Renfro at tntech.edu> wrote:
>
>> Chris’ method will set CUDA_VISIBLE_DEVICES like you’re used to, and it
>> will help keep you or your users from picking conflicting devices.
>>
>> My cgroup/GPU settings from slurm.conf:
>>
>> =====
>>
>> [renfro at login ~]$ egrep -i '(cgroup|gpu)' /etc/slurm/slurm.conf | grep
>> -v '^#'
>> ProctrackType=proctrack/cgroup
>> TaskPlugin=task/affinity,task/cgroup
>> NodeName=gpunode[001-004]  CoresPerSocket=14 RealMemory=126000 Sockets=2
>> ThreadsPerCore=1 Gres=gpu:2
>> PartitionName=gpu Default=NO MinNodes=1 DefaultTime=1-00:00:00
>> MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1
>> DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0
>> PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL
>> LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP
>> Nodes=gpunode[001-004]
>> PartitionName=gpu-debug Default=NO MinNodes=1 MaxTime=00:30:00
>> AllowGroups=ALL PriorityJobFactor=2 PriorityTier=1 DisableRootJobs=NO
>> RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO
>> DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO
>> OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]
>> PartitionName=gpu-interactive Default=NO MinNodes=1 MaxNodes=2
>> MaxTime=02:00:00 AllowGroups=ALL PriorityJobFactor=3 PriorityTier=1
>> DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0
>> PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL
>> LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP
>> Nodes=gpunode[001-004]
>> GresTypes=gpu,mic
>>
>> =====
>>
>> Example (where srun is a function that runs “srun --pty $SHELL -I”), with
>> no CUDA_VISIBLE_DEVICES on the submit host, but is correctly set on
>> reserving GPUs:
>>
>> =====
>>
>> [renfro at login ~]$ echo $CUDA_VISIBLE_DEVICES
>>
>> [renfro at login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:1
>> [renfro at gpunode003 ~]$ echo $CUDA_VISIBLE_DEVICES
>> 0
>> [renfro at login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:2
>> [renfro at gpunode004 ~]$ echo $CUDA_VISIBLE_DEVICES
>> 0,1
>>
>> =====
>>
>> > On Aug 30, 2018, at 4:18 AM, Chaofeng Zhang <zhangcf1 at lenovo.com>
>> wrote:
>> >
>> > CUDA_VISBLE_DEVICES is used by many AI framework to determine which gpu
>> to use, like tensorflow. So this environment is critical to us.
>> >
>> > -----Original Message-----
>> > From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of
>> Chris Samuel
>> > Sent: Thursday, August 30, 2018 4:42 PM
>> > To: slurm-users at lists.schedmd.com
>> > Subject: [External] Re: [slurm-users] serious bug about
>> CUDA_VISBLE_DEVICES in the slurm 17.11.7
>> >
>> > On Thursday, 30 August 2018 6:38:08 PM AEST Chaofeng Zhang wrote:
>> >
>> >> The CUDA_VISBLE_DEVICES can't be set NoDevFiles in Slurm 17.11.7.
>> >> This is worked when we use Slurm 17.02.
>> >
>> > You probably should be using cgroups instead to constrain access to
>> GPUs.
>> > Then it doesn't matter what you set CUDA_VISBLE_DEVICES to be as
>> processes will only be able to access what they requested.
>> >
>> > Hope that helps!
>> > Chris
>> > --
>> > Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
>> >
>> >
>> >
>> >
>> >
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180830/e92f42b1/attachment.html>


More information about the slurm-users mailing list