[slurm-users] Managing shared memory (/dev/shm) usage per job?

Tue Apr 5 20:11:27 UTC 2022

I've thought-experimented this in the past, wanting to do the same thing
but haven't found any way to get a/dev/shm or a tmpfs into a job's cgroups
to be accounted against the job's allocation. The best I have come up with
is creating a per-job tmpfs from a prolog, removing from epilog and setting
its size to be some amount of memory that at least puts some restriction on
how much damage the job could do. Another alternative is to only allow
access to a memory filesystem if the job request is exclusive and takes the
whole node. Crude, but effective at least to the point of preventing one
job from killing others. If you happen to find a real solution, please post
it :)

griznog

On Mon, Apr 4, 2022 at 10:19 AM Mark Coatsworth <
mark.coatsworth at vectorinstitute.ai> wrote:

> Hi all,
>
> We have a GPU cluster (Slurm 19.05.3) that typically runs large PyTorch
> jobs dependent on shared memory (/dev/shm). When our machines get busy, we
> often run into a problem where one job exhausts all the shared memory on a
> system, causing any other jobs landing there to fail immediately.
>
> We're trying to figure out a good way to manage this resource. I know that
> Slurm counts shared memory as part of a job's total memory allocation, so
> we could use cgroups to OOM kill jobs that exceed this. But that doesn't
> prevent a user from just making a large request and exhausting it all
> anyway.
>
> Does anybody have any thoughts or experience with setting real limits on
> shared memory, and either swapping it out or killing the job if this gets
> exceeded? One thought we had was to use a new generic resource (GRES). This
> is pretty easy to add in the configuration, but seems like it would be a
> huge task to write a plugin that actually enforces it.
>
> Is this something where the Job Container plugin might be useful?
>
> Any thoughts or suggestions would be appreciated,
>
> Mark
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220405/71ffb62a/attachment.htm>