[slurm-users] Managing shared memory (/dev/shm) usage per job?

Mon Apr 4 15:16:37 UTC 2022

Hi all,

We have a GPU cluster (Slurm 19.05.3) that typically runs large PyTorch
jobs dependent on shared memory (/dev/shm). When our machines get busy, we
often run into a problem where one job exhausts all the shared memory on a
system, causing any other jobs landing there to fail immediately.

We're trying to figure out a good way to manage this resource. I know that
Slurm counts shared memory as part of a job's total memory allocation, so
we could use cgroups to OOM kill jobs that exceed this. But that doesn't
prevent a user from just making a large request and exhausting it all
anyway.

Does anybody have any thoughts or experience with setting real limits on
shared memory, and either swapping it out or killing the job if this gets
exceeded? One thought we had was to use a new generic resource (GRES). This
is pretty easy to add in the configuration, but seems like it would be a
huge task to write a plugin that actually enforces it.

Is this something where the Job Container plugin might be useful?

Any thoughts or suggestions would be appreciated,

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220404/c1ded0c2/attachment.htm>