<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Thanks Christoph and others for the help.</p>
<p>Turns out it is very simply setting cgroups that I had most of
the way set months ago and even left myself a note to uncomment
ConstrainDevices=yes in cgroup.conf when the GPU systems came
online.</p>
<p>Kept racking my brain why the gres settings didn't include
anything while it would set the number of requested GPUs
correctly.</p>
<p>Everything is working as expected now.<br>
</p>
<div class="moz-signature">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title></title>
<table cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td width="150" valign="top" height="30" align="left">
<p style="font-size:14px;">Willy Markuske</p>
</td>
</tr>
<tr>
<td style="border-right: 1px solid #000;" align="left">
<p style="font-size:12px;">HPC Systems Engineer</p>
</td>
<td rowspan="3" width="180" valign="center" height="42"
align="center"><tt><img moz-do-not-send="false"
src="cid:part1.81E99EA1.445EFD52@sdsc.edu" alt=""
width="168" height="48"></tt> </td>
</tr>
<tr>
<td style="border-right: 1px solid #000;" align="left">
<p style="font-size:12px;">Research Data Services</p>
</td>
</tr>
<tr>
<td style="border-right: 1px solid #000;" align="left">
<p style="font-size:12px;">P: (858) 246-5593</p>
</td>
</tr>
</tbody>
</table>
<p> </p>
</div>
<div class="moz-cite-prefix">On 8/25/20 8:24 AM, Christoph Brüning
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:f026a81e-0bcf-e3b7-17fb-28ed62eb0bd1@uni-wuerzburg.de">Hello,
<br>
<br>
we're using cgroups to restrict access to the GPUs.
<br>
<br>
What I found particularly helpful, are the slides by Marshall
Garey from last year's Slurm User Group Meeting:
<a class="moz-txt-link-freetext" href="https://urldefense.com/v3/__https://slurm.schedmd.com/SLUG19/cgroups_and_pam_slurm_adopt.pdf__;!!Mih3wA!XNe605WUGPer00S7oSxp5Vkj06UAdkDNiE-hhGSr9HvCBjneYA_8p1C12xnCD17p$">https://urldefense.com/v3/__https://slurm.schedmd.com/SLUG19/cgroups_and_pam_slurm_adopt.pdf__;!!Mih3wA!XNe605WUGPer00S7oSxp5Vkj06UAdkDNiE-hhGSr9HvCBjneYA_8p1C12xnCD17p$</a>
(NVML didn't work for us for some reason I cannot recall, but
listing the GPU device files explicitly was not a big deal)
<br>
<br>
Best,
<br>
Christoph
<br>
<br>
<br>
On 25/08/2020 16.12, Willy Markuske wrote:
<br>
<blockquote type="cite">Hello,
<br>
<br>
I'm trying to restrict access to gpu resources on a cluster I
maintain for a research group. There are two nodes put into a
partition with gres gpu resources defined. User can access these
resources by submitting their job under the gpu partition and
defining a gres=gpu.
<br>
<br>
When a user includes the flag --gres=gpu:# they are allocated
the number of gpus and slurm properly allocates them. If a user
requests only 1 gpu they only see CUDA_VISIBLE_DEVICES=1.
However, if a user does not include the --gres=gpu:# flag they
can still submit a job to the partition and are then able to see
all the GPUs. This has led to some bad actors running jobs on
all GPUs that other users have allocated and causing OOM errors
on the gpus.
<br>
<br>
Is it possible, and where would I find the documentation on
doing so, to require users to define a --gres=gpu:# to be able
to submit to a partition? So far reading the gres documentation
doesn't seem to have yielded any word on this issue
specifically.
<br>
<br>
Regards,
<br>
<br>
-- <br>
<br>
Willy Markuske
<br>
<br>
HPC Systems Engineer
<br>
<br>
<br>
<br>
Research Data Services
<br>
<br>
P: (858) 246-5593
<br>
<br>
</blockquote>
<br>
</blockquote>
</body>
</html>