<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 15 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:DokChampa;
        panose-1:2 11 6 4 2 2 2 2 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0cm;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;
        mso-fareast-language:EN-US;}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri",sans-serif;
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;
        mso-fareast-language:EN-US;}
@page WordSection1
        {size:612.0pt 792.0pt;
        margin:70.85pt 70.85pt 2.0cm 70.85pt;}
div.WordSection1
        {page:WordSection1;}
/* List Definitions */
@list l0
        {mso-list-id:584149631;
        mso-list-template-ids:-1412669512;}
@list l1
        {mso-list-id:724333469;
        mso-list-template-ids:689577328;}
ol
        {margin-bottom:0cm;}
ul
        {margin-bottom:0cm;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-GB link="#0563C1" vlink="#954F72" style='word-wrap:break-word'><div class=WordSection1><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'>Dear Slurm users,<o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'> <o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'>I am looking for a SLURM setting that will kill a job immediately when any subprocess of that job hits an OOM limit. Several posts have touched upon that, e.g: <a href="https://www.mail-archive.com/slurm-users@lists.schedmd.com/msg04091.html" target="_blank"><span style='color:#0563C1'>https://www.mail-archive.com/slurm-users@lists.schedmd.com/msg04091.html</span></a>  and <a href="https://www.mail-archive.com/slurm-users@lists.schedmd.com/msg04190.html" target="_blank"><span style='color:#0563C1'>https://www.mail-archive.com/slurm-users@lists.schedmd.com/msg04190.html</span></a> or <a href="https://bugs.schedmd.com/show_bug.cgi?id=3216" target="_blank"><span style='color:#0563C1'>https://bugs.schedmd.com/show_bug.cgi?id=3216</span></a> but I cannot find an answer that works in our setting.<o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'> <o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'>The two options I have found are:<o:p></o:p></span></p><ol style='margin-top:0cm' start=1 type=1><li class=MsoNormal style='color:#222222;mso-list:l1 level1 lfo2;background:white'><span style='mso-fareast-language:EN-GB'>Set shebang to </span><span style='font-family:"Courier New";mso-fareast-language:EN-GB'>#!/bin/bash -e</span><span style='mso-fareast-language:EN-GB'>, which we don’t want to do as we’d need to change this for hundreds of scripts from another cluster where we had a different scheduler, AND it would kill tasks for other runtime errors (e.g. if one command in the script doesn’t find a file).</span><span style='font-family:"Arial",sans-serif;mso-fareast-language:EN-GB'><o:p></o:p></span></li><li class=MsoNormal style='color:#222222;mso-list:l1 level1 lfo2;background:white'><span style='mso-fareast-language:EN-GB'>Set </span><span style='font-family:"Courier New";mso-fareast-language:EN-GB'>KillOnBadExit=1</span><span style='mso-fareast-language:EN-GB'>. I am puzzled by this one. This is supposed to be overridden by srun’s -K option. Using the example below, </span><span style='font-family:"Courier New";mso-fareast-language:EN-GB'>srun -K --mem=1G ./multalloc.sh</span><span style='mso-fareast-language:EN-GB'> would be expected to kill the job at the first OOM. But it doesn’t, and happily keeps reporting 3 oom-kill events. So, will this work?</span><span style='font-family:"Arial",sans-serif;mso-fareast-language:EN-GB'><o:p></o:p></span></li></ol><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'> <o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'>The reason we want this is that we have script that execute programs in loops. These programs are slow and memory intensive. When the first one crashes for OOM, the next iterations also crash. In the current setup, we are wasting days executing loops where every iteration crashes after an hour or so due to OOM.<o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'> <o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'>We are using cgroups (and we want to keep them) with the following config:<o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>CgroupAutomount=yes</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>ConstrainCores=yes</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>ConstrainDevices=yes</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>ConstrainKmemSpace=no</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>ConstrainRAMSpace=yes</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>ConstrainSwapSpace=yes</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>MaxSwapPercent=10</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>TaskAffinity=no</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'> <o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'>Relevant bits from slurm.conf:<o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>SelectType=select/cons_tres</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>GresTypes=gpu,mps,bandwidth</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'> <o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'> <o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'>Very simple example:<o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>#!/bin/bash</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'># multalloc.sh – each line is a very simple cpp program that allocates a 8Gb vector and fills it with random floats</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>echo one</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>./alloc8Gb</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>echo two</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>./alloc8Gb</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>echo three</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>./alloc8Gb</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>echo done.</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'> <o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'>This is submitted as follows:<o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'> <o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>sbatch --mem=1G ./multalloc.sh</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'> <o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'>The log is :<o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>one</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>./multalloc.sh: line 4: 231155 Killed                  ./alloc8Gb</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>two</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>./multalloc.sh: line 6: 231181 Killed                  ./alloc8Gb</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>three</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>./multalloc.sh: line 8: 231263 Killed                  ./alloc8Gb</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>done.</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'>slurmstepd: error: Detected 3 oom-kill event(s) in StepId=3130111.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.</span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='font-family:"Courier New";color:#222222;mso-fareast-language:EN-GB'> </span><span style='font-family:"Arial",sans-serif;color:#222222;mso-fareast-language:EN-GB'><o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'>I am expecting an OOM job kill right before “two”.<o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'> <o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'>Any help appreciated.<o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'> <o:p></o:p></span></p><p class=MsoNormal style='background:white'><span style='color:#222222;mso-fareast-language:EN-GB'>Best regards,<o:p></o:p></span></p><p class=MsoNormal><span style='color:#888888;background:white;mso-fareast-language:EN-GB'> <o:p></o:p></span></p><p class=MsoNormal><span style='color:#888888;background:white;mso-fareast-language:EN-GB'>Arthur<o:p></o:p></span></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><span style='mso-fareast-language:EN-GB'>-------------------------------------------------------------<o:p></o:p></span></p><p class=MsoNormal><span style='mso-fareast-language:EN-GB'>Dr. Arthur Gilly<o:p></o:p></span></p><p class=MsoNormal><span style='mso-fareast-language:EN-GB'>Head of Analytics<o:p></o:p></span></p><p class=MsoNormal><span style='mso-fareast-language:EN-GB'>Institute of Translational Genomics<o:p></o:p></span></p><p class=MsoNormal><span style='mso-fareast-language:EN-GB'>Helmholtz-Centre Munich (HMGU)<o:p></o:p></span></p><p class=MsoNormal><span style='mso-fareast-language:EN-GB'>-------------------------------------------------------------<o:p></o:p></span></p><p class=MsoNormal><o:p> </o:p></p></div>
<br><html><body>Helmholtz Zentrum München <br>

Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH) <br>

Ingolstädter Landstr. 1 <br>

85764 Neuherberg <br>

www.helmholtz-muenchen.de <br>

Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling <br>

Geschäftsführung: Prof. Dr. med. Dr. h.c. Matthias Tschöp, Kerstin Günther<br>

Registergericht: Amtsgericht München HRB 6466 <br>

USt-IdNr: DE 129521671
<br>
<br></body></html>

<br>
<br>
<br></body></html>