[slurm-users] GRES Restrictions

Tue Aug 25 15:12:56 UTC 2020

cgroups should work correctly _if_ you're not running with an old corrupted
slurm database.

There was a bug in a much earlier version of slurm that corrupted the
database in a way that the cgroups/accounting code could no longer fence
GPUs. This was fixed in a later version, but the database corruption
carries forward.

Apparently the db can be fixed manually, but we're just starting with a new
install and fresh db.

On Tue, Aug 25, 2020 at 11:03 AM Ryan Novosielski <novosirj at rutgers.edu>
wrote:

> Sorry about that. “NJT” should have read “but;” apparently my phone
> decided I was talking about our local transit authority. 😓
>
> On Aug 25, 2020, at 10:30, Ryan Novosielski <novosirj at rutgers.edu> wrote:
>
>  I believe that’s done via a QoS on the partition. Have a look at the
> docs there, and I think “require” is a good key word to look for.
>
> Cgroups should also help with this, NJT I’ve been troubleshooting a
> problem where that seems not to be working correctly.
>
> --
> ____
> || \\UTGERS,
> |---------------------------*O*---------------------------
> ||_// the State     |         Ryan Novosielski - novosirj at rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\    of NJ     | Office of Advanced Research Computing - MSB C630,
> Newark
>     `'
>
> On Aug 25, 2020, at 10:13, Willy Markuske <wmarkuske at sdsc.edu> wrote:
>
> 
>
> Hello,
>
> I'm trying to restrict access to gpu resources on a cluster I maintain for
> a research group. There are two nodes put into a partition with gres gpu
> resources defined. User can access these resources by submitting their job
> under the gpu partition and defining a gres=gpu.
>
> When a user includes the flag --gres=gpu:# they are allocated the number
> of gpus and slurm properly allocates them. If a user requests only 1 gpu
> they only see CUDA_VISIBLE_DEVICES=1. However, if a user does not include
> the --gres=gpu:# flag they can still submit a job to the partition and are
> then able to see all the GPUs. This has led to some bad actors running jobs
> on all GPUs that other users have allocated and causing OOM errors on the
> gpus.
>
> Is it possible, and where would I find the documentation on doing so, to
> require users to define a --gres=gpu:# to be able to submit to a partition?
> So far reading the gres documentation doesn't seem to have yielded any word
> on this issue specifically.
>
> Regards,
> --
>
> Willy Markuske
>
> HPC Systems Engineer
> <SDSClogo-plusname-red.jpg>
>
> Research Data Services
>
> P: (858) 246-5593
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200825/21bd4079/attachment.htm>