<div dir="auto">Thanks Kevin!<div dir="auto"><br></div><div dir="auto">Indeed, nvidia-smi in an interactive job tells me that I can get access to the device when I should not be able to.</div><div dir="auto"><br></div><div dir="auto">I thought including the /dev/nvidia* would whitelist those devices ... which seems to be the opposite of what I want, no?  Or do I misunderstand?</div><div dir="auto"><br></div><div dir="auto">Thanks,</div><div dir="auto">Paul</div></div><br><div class="gmail_quote"><div dir="ltr">On Tue, May 1, 2018, 19:00 Kevin Manalo <<a href="mailto:kmanalo@jhu.edu">kmanalo@jhu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Paul, <br>

<br>

Having recently set this up, this was my test, when you make a single GPU request from inside an interactive run (salloc ... --gres=gpu:1 srun --pty bash) request you should only see the GPU assigned to you via 'nvidia-smi'<br>

<br>

When gres is unset you should see <br>

<br>

nvidia-smi<br>

No devices were found<br>

<br>

Otherwise, if you ask for 1 of 2, you should only see 1 device.<br>

<br>

Also, I recall appending this to the bottom of <br>

<br>

[cgroup_allowed_devices_file.conf]<br>

..<br>

Same as yours<br>

...<br>

/dev/nvidia*<br>

<br>

There was a SLURM bug issue that made this clear, not so much in the website docs.<br>

<br>

-Kevin<br>

<br>

<br>

On 5/1/18, 5:28 PM, "slurm-users on behalf of R. Paul Wiegand" <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank" rel="noreferrer">slurm-users-bounces@lists.schedmd.com</a> on behalf of <a href="mailto:rpwiegand@gmail.com" target="_blank" rel="noreferrer">rpwiegand@gmail.com</a>> wrote:<br>

<br>

    Greetings,<br>

<br>

    I am setting up our new GPU cluster, and I seem to have a problem<br>

    configuring things so that the devices are properly walled off via<br>

    cgroups.  Our nodes each of two GPUS; however, if --gres is unset, or<br>

    set to --gres=gpu:0, I can access both GPUs from inside a job.<br>

    Moreover, if I ask for just 1 GPU then unset the CUDA_VISIBLE_DEVICES<br>

    environmental variable, I can access both GPUs.  From my<br>

    understanding, this suggests that it is *not* being protected under<br>

    cgroups.<br>

<br>

    I've read the documentation, and I've read through a number of threads<br>

    where people have resolved similar issues.  I've tried a lot of<br>

    configurations, but to no avail. Below I include some snippets of<br>

    relevant (current) parameters; however, I also am attaching most of<br>

    our full conf files.<br>

<br>

    [slurm.conf]<br>

    ProctrackType=proctrack/cgroup<br>

    TaskPlugin=task/cgroup<br>

    SelectType=select/cons_res<br>

    SelectTypeParameters=CR_Core_Memory<br>

    JobAcctGatherType=jobacct_gather/linux<br>

    AccountingStorageTRES=gres/gpu<br>

    GresTypes=gpu<br>

<br>

    NodeName=evc1 CPUs=32 RealMemory=191917 Sockets=2 CoresPerSocket=16<br>

    ThreadsPerCore=1 State=UNKNOWN NodeAddr=ivc1 Weight=1 Gres=gpu:2<br>

<br>

    [gres.conf]<br>

    NodeName=evc[1-10] Name=gpu File=/dev/nvidia0<br>

    COREs=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30<br>

    NodeName=evc[1-10] Name=gpu File=/dev/nvidia1<br>

    COREs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31<br>

<br>

    [cgroup.conf]<br>

    ConstrainDevices=yes<br>

<br>

    [cgroup_allowed_devices_file.conf]<br>

    /dev/null<br>

    /dev/urandom<br>

    /dev/zero<br>

    /dev/sda*<br>

    /dev/cpu/*/*<br>

    /dev/pts/*<br>

<br>

    Thanks,<br>

    Paul.<br>

<br>

<br>

</blockquote></div>