<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p><tt>and to answer "CUDA_VISBLE_DEVICES can't be set NoDevFiles in

        Slurm 17.11.7"</tt></p>

    <p><tt>CUDA_VISIBLE_DEVICES is unset if --gres=none and if set in the

        user's environment, it will remains set to whatever.  If you

        want really want to see NoDevFIles, set it in /etc/profile.d, it

        will get clobbered when the resources are actually there.</tt><tt><br>

      </tt></p>

    <p><tt><br>

      </tt></p>

    <p><tt>$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=none

        -p GPU /usr/bin/env |grep CUDA</tt><tt><br>

      </tt>

      <tt><b>CUDA_VISIBLE_DEVICES=0,1</b></tt><tt><br>

      </tt><tt>$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1

        --gres=none -p GPU nvidia-smi</tt><tt><br>

      </tt>

      <tt><b>No devices were found</b></tt><tt><br>

      </tt>

    </p>

    <p>

      <tt><br>

      </tt></p>

    <tt>$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:1

      -p GPU /usr/bin/env |grep CUDA</tt><tt><b><br>

      </b></tt><tt><b>CUDA_VISIBLE_DEVICES=0</b></tt><br>

    <tt>$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:1

      -p GPU nvidia-smi |grep Tesla | wc</tt><br>

    <tt> </tt><tt> </tt><tt> </tt><tt><b>     1      11      80</b></tt><tt><br>

    </tt>

    <tt>$ </tt><br>

    <br>

    <br>

    <p><tt>$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1

        --gres=gpu:2 -p GPU /usr/bin/env |grep CUDA</tt><tt><br>

      </tt>

      <tt><b>CUDA_VISIBLE_DEVICES=0,1</b></tt><tt><br>

      </tt>

      <tt>$ export CUDA_VISIBLE_DEVICES=0,1; srun -N 1 -n 1 --gres=gpu:2

        -p GPU nvidia-smi |grep Tesla | wc</tt><tt><br>

      </tt><tt>

      </tt><tt><b>      2      22     160</b></tt><tt><br>

      </tt>

      <tt>$ </tt>

    </p>

    <p><tt><br>

      </tt>

    </p>

    <br>

    <div class="moz-cite-prefix">On 08/30/2018 10:48 AM, Renfro, Michael

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:09D17229-F44C-4B5E-B251-4DBE87B2F630@tntech.edu">

      <pre wrap="">Chris’ method will set CUDA_VISIBLE_DEVICES like you’re used to, and it will help keep you or your users from picking conflicting devices.

My cgroup/GPU settings from slurm.conf:

=====

[renfro@login ~]$ egrep -i '(cgroup|gpu)' /etc/slurm/slurm.conf | grep -v '^#'

ProctrackType=proctrack/cgroup

TaskPlugin=task/affinity,task/cgroup

NodeName=gpunode[001-004]  CoresPerSocket=14 RealMemory=126000 Sockets=2 ThreadsPerCore=1 Gres=gpu:2

PartitionName=gpu Default=NO MinNodes=1 DefaultTime=1-00:00:00 MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]

PartitionName=gpu-debug Default=NO MinNodes=1 MaxTime=00:30:00 AllowGroups=ALL PriorityJobFactor=2 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]

PartitionName=gpu-interactive Default=NO MinNodes=1 MaxNodes=2 MaxTime=02:00:00 AllowGroups=ALL PriorityJobFactor=3 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=4000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]

GresTypes=gpu,mic

=====

Example (where srun is a function that runs “srun --pty $SHELL -I”), with no CUDA_VISIBLE_DEVICES on the submit host, but is correctly set on reserving GPUs:

=====

[renfro@login ~]$ echo $CUDA_VISIBLE_DEVICES

[renfro@login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:1

[renfro@gpunode003 ~]$ echo $CUDA_VISIBLE_DEVICES

0

[renfro@login ~]$ hpcshell --partition=gpu-interactive --gres=gpu:2

[renfro@gpunode004 ~]$ echo $CUDA_VISIBLE_DEVICES

0,1

=====

</pre>

      <blockquote type="cite">

        <pre wrap="">On Aug 30, 2018, at 4:18 AM, Chaofeng Zhang <a class="moz-txt-link-rfc2396E" href="mailto:zhangcf1@lenovo.com"><zhangcf1@lenovo.com></a> wrote:

CUDA_VISBLE_DEVICES is used by many AI framework to determine which gpu to use, like tensorflow. So this environment is critical to us.

-----Original Message-----

From: slurm-users <a class="moz-txt-link-rfc2396E" href="mailto:slurm-users-bounces@lists.schedmd.com"><slurm-users-bounces@lists.schedmd.com></a> On Behalf Of Chris Samuel

Sent: Thursday, August 30, 2018 4:42 PM

To: <a class="moz-txt-link-abbreviated" href="mailto:slurm-users@lists.schedmd.com">slurm-users@lists.schedmd.com</a>

Subject: [External] Re: [slurm-users] serious bug about CUDA_VISBLE_DEVICES in the slurm 17.11.7

On Thursday, 30 August 2018 6:38:08 PM AEST Chaofeng Zhang wrote:

</pre>

        <blockquote type="cite">

          <pre wrap="">The CUDA_VISBLE_DEVICES can't be set NoDevFiles in Slurm 17.11.7.  

This is worked when we use Slurm 17.02.

</pre>

        </blockquote>

        <pre wrap="">

You probably should be using cgroups instead to constrain access to GPUs.  

Then it doesn't matter what you set CUDA_VISBLE_DEVICES to be as processes will only be able to access what they requested.

Hope that helps!

Chris

--

Chris Samuel  :  <a class="moz-txt-link-freetext" href="http://www.csamuel.org/">http://www.csamuel.org/</a>  :  Melbourne, VIC

</pre>

      </blockquote>

      <pre wrap="">

</pre>

    </blockquote>

    <br>

  </body>

</html>