[slurm-users] [External] Re: serious bug about CUDA_VISBLE_DEVICES in the slurm 17.11.7
Renfro, Michael
Renfro at tntech.edu
Thu Aug 30 08:48:40 MDT 2018
Chris’ method will set CUDA_VISIBLE_DEVICES like you’re used to, and it will help keep you or your users from picking conflicting devices.
My cgroup/GPU settings from slurm.conf:
=====
[renfro at login ~]$ egrep -i '(cgroup|gpu)' /etc/slurm/slurm.conf | grep -v '^#'
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity,task/cgroup
NodeName=gpunode[001-004] CoresPerSocket=14 RealMemory=126000 Sockets=2 ThreadsPerCore=1 Gres=gpu:2
PartitionName=gpu Default=NO MinNodes=1 DefaultTime=1-00:00:00 MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]
PartitionName=gpu-debug Default=NO MinNodes=1 MaxTime=00:30:00 AllowGroups=ALL PriorityJobFactor=2 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]
PartitionName=gpu-interactive Default=NO MinNodes=1 MaxNodes=2 MaxTime=02:00:00 AllowGroups=ALL PriorityJobFactor=3 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]
GresTypes=gpu,mic
=====
Example (where srun is a function that runs “srun --pty $SHELL -I”), with no CUDA_VISIBLE_DEVICES on the submit host, but is correctly set on reserving GPUs:
=====
[renfro at login ~]$ echo $CUDA_VISIBLE_DEVICES
[renfro at login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:1
[renfro at gpunode003 ~]$ echo $CUDA_VISIBLE_DEVICES
0
[renfro at login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:2
[renfro at gpunode004 ~]$ echo $CUDA_VISIBLE_DEVICES
0,1
=====
> On Aug 30, 2018, at 4:18 AM, Chaofeng Zhang <zhangcf1 at lenovo.com> wrote:
>
> CUDA_VISBLE_DEVICES is used by many AI framework to determine which gpu to use, like tensorflow. So this environment is critical to us.
>
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Chris Samuel
> Sent: Thursday, August 30, 2018 4:42 PM
> To: slurm-users at lists.schedmd.com
> Subject: [External] Re: [slurm-users] serious bug about CUDA_VISBLE_DEVICES in the slurm 17.11.7
>
> On Thursday, 30 August 2018 6:38:08 PM AEST Chaofeng Zhang wrote:
>
>> The CUDA_VISBLE_DEVICES can't be set NoDevFiles in Slurm 17.11.7.
>> This is worked when we use Slurm 17.02.
>
> You probably should be using cgroups instead to constrain access to GPUs.
> Then it doesn't matter what you set CUDA_VISBLE_DEVICES to be as processes will only be able to access what they requested.
>
> Hope that helps!
> Chris
> --
> Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
>
>
>
>
>
More information about the slurm-users
mailing list