<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p><tt>and to answer "CUDA_VISBLE_DEVICES can't be set NoDevFiles in
Slurm 17.11.7"</tt></p>
<p><tt>CUDA_VISIBLE_DEVICES is unset if --gres=none and if set in the
user's environment, it will remains set to whatever. If you
want really want to see NoDevFIles, set it in /etc/profile.d, it
will get clobbered when the resources are actually there.</tt><tt><br>
</tt></p>
<p><tt><br>
</tt></p>
<p><tt>$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=none
-p GPU /usr/bin/env |grep CUDA</tt><tt><br>
</tt>
<tt><b>CUDA_VISIBLE_DEVICES=0,1</b></tt><tt><br>
</tt><tt>$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1
--gres=none -p GPU nvidia-smi</tt><tt><br>
</tt>
<tt><b>No devices were found</b></tt><tt><br>
</tt>
</p>
<p>
<tt><br>
</tt></p>
<tt>$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:1
-p GPU /usr/bin/env |grep CUDA</tt><tt><b><br>
</b></tt><tt><b>CUDA_VISIBLE_DEVICES=0</b></tt><br>
<tt>$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:1
-p GPU nvidia-smi |grep Tesla | wc</tt><br>
<tt> </tt><tt> </tt><tt> </tt><tt><b> 1 11 80</b></tt><tt><br>
</tt>
<tt>$ </tt><br>
<br>
<br>
<p><tt>$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1
--gres=gpu:2 -p GPU /usr/bin/env |grep CUDA</tt><tt><br>
</tt>
<tt><b>CUDA_VISIBLE_DEVICES=0,1</b></tt><tt><br>
</tt>
<tt>$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:2
-p GPU nvidia-smi |grep Tesla | wc</tt><tt><br>
</tt><tt>
</tt><tt><b> 2 22 160</b></tt><tt><br>
</tt>
<tt>$ </tt>
</p>
<p><tt><br>
</tt>
</p>
<br>
<div class="moz-cite-prefix">On 08/30/2018 10:48 AM, Renfro, Michael
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:09D17229-F44C-4B5E-B251-4DBE87B2F630@tntech.edu">
<pre wrap="">Chris’ method will set CUDA_VISIBLE_DEVICES like you’re used to, and it will help keep you or your users from picking conflicting devices.
My cgroup/GPU settings from slurm.conf:
=====
[renfro@login ~]$ egrep -i '(cgroup|gpu)' /etc/slurm/slurm.conf | grep -v '^#'
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity,task/cgroup
NodeName=gpunode[001-004] CoresPerSocket=14 RealMemory=126000 Sockets=2 ThreadsPerCore=1 Gres=gpu:2
PartitionName=gpu Default=NO MinNodes=1 DefaultTime=1-00:00:00 MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]
PartitionName=gpu-debug Default=NO MinNodes=1 MaxTime=00:30:00 AllowGroups=ALL PriorityJobFactor=2 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]
PartitionName=gpu-interactive Default=NO MinNodes=1 MaxNodes=2 MaxTime=02:00:00 AllowGroups=ALL PriorityJobFactor=3 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]
GresTypes=gpu,mic
=====
Example (where srun is a function that runs “srun --pty $SHELL -I”), with no CUDA_VISIBLE_DEVICES on the submit host, but is correctly set on reserving GPUs:
=====
[renfro@login ~]$ echo $CUDA_VISIBLE_DEVICES
[renfro@login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:1
[renfro@gpunode003 ~]$ echo $CUDA_VISIBLE_DEVICES
0
[renfro@login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:2
[renfro@gpunode004 ~]$ echo $CUDA_VISIBLE_DEVICES
0,1
=====
</pre>
<blockquote type="cite">
<pre wrap="">On Aug 30, 2018, at 4:18 AM, Chaofeng Zhang <a class="moz-txt-link-rfc2396E" href="mailto:zhangcf1@lenovo.com"><zhangcf1@lenovo.com></a> wrote:
CUDA_VISBLE_DEVICES is used by many AI framework to determine which gpu to use, like tensorflow. So this environment is critical to us.
-----Original Message-----
From: slurm-users <a class="moz-txt-link-rfc2396E" href="mailto:slurm-users-bounces@lists.schedmd.com"><slurm-users-bounces@lists.schedmd.com></a> On Behalf Of Chris Samuel
Sent: Thursday, August 30, 2018 4:42 PM
To: <a class="moz-txt-link-abbreviated" href="mailto:slurm-users@lists.schedmd.com">slurm-users@lists.schedmd.com</a>
Subject: [External] Re: [slurm-users] serious bug about CUDA_VISBLE_DEVICES in the slurm 17.11.7
On Thursday, 30 August 2018 6:38:08 PM AEST Chaofeng Zhang wrote:
</pre>
<blockquote type="cite">
<pre wrap="">The CUDA_VISBLE_DEVICES can't be set NoDevFiles in Slurm 17.11.7.
This is worked when we use Slurm 17.02.
</pre>
</blockquote>
<pre wrap="">
You probably should be using cgroups instead to constrain access to GPUs.
Then it doesn't matter what you set CUDA_VISBLE_DEVICES to be as processes will only be able to access what they requested.
Hope that helps!
Chris
--
Chris Samuel : <a class="moz-txt-link-freetext" href="http://www.csamuel.org/">http://www.csamuel.org/</a> : Melbourne, VIC
</pre>
</blockquote>
<pre wrap="">
</pre>
</blockquote>
<br>
</body>
</html>