<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
font-size:10.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
span.EmailStyle18
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style>
</head>
<body lang="en-SA" link="blue" vlink="purple" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt">Hi John, Mark,<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt">We use a spank plugin
<a href="https://gitlab.com/greg.wickham/slurm-spank-private-tmpdir">https://gitlab.com/greg.wickham/slurm-spank-private-tmpdir</a> (this was derived from other authors but modified for functionality required on site).<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt">It can bind tmpfs mount points to the users cgroup allocation, additionally bind options can be provided (ie: limit memory by size, limit memory by % as supported by tmpfs(5))<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt">More information is in the README.md<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt"> -Greg<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt">On 05/04/2022, 23:17, "slurm-users" <slurm-users-bounces@lists.schedmd.com> wrote:<o:p></o:p></span></p>
</div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt">I've thought-experimented this in the past, wanting to do the same thing but haven't found any way to get a/dev/shm or a tmpfs into a job's cgroups to be accounted against the job's
allocation. The best I have come up with is creating a per-job tmpfs from a prolog, removing from epilog and setting its size to be some amount of memory that at least puts some restriction on how much damage the job could do. Another alternative is to only
allow access to a memory filesystem if the job request is exclusive and takes the whole node. Crude, but effective at least to the point of preventing one job from killing others. If you happen to find a real solution, please post it :) <o:p></o:p></span></p>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt">griznog<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<div>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt">On Mon, Apr 4, 2022 at 10:19 AM Mark Coatsworth <<a href="mailto:mark.coatsworth@vectorinstitute.ai">mark.coatsworth@vectorinstitute.ai</a>> wrote:<o:p></o:p></span></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0cm 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt">Hi all,<o:p></o:p></span></p>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt">We have a GPU cluster (Slurm 19.05.3) that typically runs large PyTorch jobs dependent on shared memory (/dev/shm). When our machines get busy, we often run into a problem where
one job exhausts all the shared memory on a system, causing any other jobs landing there to fail immediately.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt">We're trying to figure out a good way to manage this resource. I know that Slurm counts shared memory as part of a job's total memory allocation, so we could use cgroups to OOM kill
jobs that exceed this. But that doesn't prevent a user from just making a large request and exhausting it all anyway.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt">Does anybody have any thoughts or experience with setting real limits on shared memory, and either swapping it out or killing the job if this gets exceeded? One thought we had was
to use a new generic resource (GRES). This is pretty easy to add in the configuration, but seems like it would be a huge task to write a plugin that actually enforces it.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt">Is this something where the Job Container plugin might be useful?<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt">Any thoughts or suggestions would be appreciated,<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:11.0pt">Mark<o:p></o:p></span></p>
</div>
</div>
</blockquote>
</div>
</div>
</body>
</html>