<div dir="ltr"><div></div><div><br></div><div>I guess they won't be killed, but having them there could cause other issues. I.e. any limit that systemd places on the slurmd service will be applied to the jobs as well, and probably cumulatively.<br></div><div>Do you use cgroup for the slurm resource management (the TaskPlugin)? If so it means this is not working properly.</div><div>We have a lot of customization here, so I can't be sure what change you need exactly. We have the default KillMode (control-group), and Delegate=true.</div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Nov 30, 2021 at 2:00 PM LEROY Christine 208562 <<a href="mailto:Christine.LEROY2@cea.fr">Christine.LEROY2@cea.fr</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">





<div lang="FR">
<div class="gmail-m_7637291670707332999WordSection1">
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)">Hi,<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)">Thanks for your feedback.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)" lang="EN-US">It seems we are in the 1<sup>st</sup> case, but then looking deeper: for SL7 node we didn’t encounter the problem thanks
 to this service configuration (*).<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)" lang="EN-US">So the solution seems to configure KillMode=process as mention there (**): we will still have jobs listed when doing a
 'systemctl status slurmd.service', but they won’t be killed; is that right?<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)" lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)" lang="EN-US">Thanks in advance,<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)" lang="EN-US">Christine<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)" lang="EN-US">(**)<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)" lang="EN-US"><a href="https://slurm.schedmd.com/programmer_guide.html" target="_blank">https://slurm.schedmd.com/programmer_guide.html</a><u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)" lang="EN-US">(*)<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)" lang="EN-US">grep -i killmode /lib/systemd/system/slurmd.service
<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)" lang="EN-US">KillMode=process<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)" lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)" lang="EN-US">Instead of (for ubuntu nodes)<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)" lang="EN-US">KillMode=control-group<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10pt;font-family:"Verdana",sans-serif;color:rgb(31,73,125)" lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><b><span style="font-size:11pt;font-family:"Calibri",sans-serif">De :</span></b><span style="font-size:11pt;font-family:"Calibri",sans-serif"> slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>>
<b>De la part de</b> Yair Yarom<br>
<b>Envoyé :</b> mardi 30 novembre 2021 08:50<br>
<b>À :</b> Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br>
<b>Objet :</b> Re: [slurm-users] WTERMSIG 15<u></u><u></u></span></p>
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<div>
<p class="MsoNormal">Hi,<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">There were two cases where this happened to us as well:<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">1. The systemd slurmd.service wasn't configured properly, and so the jobs ran under the slurmd.slice. So by restarting slurmd, systemd will send a signal to all processes. You can check if this is the case with 'systemctl status slurmd.service'
 - the jobs shouldn't be listed there.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">2. When changing the partitions, as jobs here are sent to most partitions by default, removing partitions or nodes from partitions might cause the jobs in the relevant partitions to be killed.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">HTH,<u></u><u></u></p>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<div>
<p class="MsoNormal">On Mon, Nov 29, 2021 at 6:46 PM LEROY Christine 208562 <<a href="mailto:Christine.LEROY2@cea.fr" target="_blank">Christine.LEROY2@cea.fr</a>> wrote:<u></u><u></u></p>
</div>
<blockquote style="border-color:currentcolor currentcolor currentcolor rgb(204,204,204);border-style:none none none solid;border-width:medium medium medium 1pt;padding:0cm 0cm 0cm 6pt;margin-left:4.8pt;margin-right:0cm">
<div>
<div>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">Hello all,</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif"> </span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">I did some modification in my slurm.conf and I’ve restarted the slurmctld on the master and then the slurmd on the
 nodes.</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">During this process I’ve lost some jobs (*), curiously all these jobs were on ubuntu nodes .</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">These jobs were ok with the consumed resources (**).</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif"> </span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">Any Idea what could be the problem ?</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">Thanks in advance</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">Best regards,</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">Christine Leroy</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif"> </span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif"> </span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">(*)</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">[2021-11-29T14:17:09.205] error: Node xxx appears to have a different slurm.conf than the slurmctld.  This could
 cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">[2021-11-29T14:17:10.162] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">[2021-11-29T14:17:42.223] _job_complete: JobId=4546 WTERMSIG 15</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">[2021-11-29T14:17:42.223] _job_complete: JobId=4546 done</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">[2021-11-29T14:17:42.224] _job_complete: JobId=4666 WTERMSIG 15</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">[2021-11-29T14:17:42.224] _job_complete: JobId=4666 done</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">[2021-11-29T14:17:42.236] _job_complete: JobId=4665 WTERMSIG 15</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">[2021-11-29T14:17:42.236] _job_complete: JobId=4665 done</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">[2021-11-29T14:17:46.072] _job_complete: JobId=4533 WTERMSIG 15</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">[2021-11-29T14:17:46.072] _job_complete: JobId=4533 done</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">[2021-11-29T14:17:59.005] _job_complete: JobId=4664 WTERMSIG 15</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">[2021-11-29T14:17:59.005] _job_complete: JobId=4664 done</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">[2021-11-29T14:17:59.006] _job_complete: JobId=4663 WTERMSIG 15</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">[2021-11-29T14:17:59.007] _job_complete: JobId=4663 done</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">[2021-11-29T14:17:59.021] _job_complete: JobId=4539 WTERMSIG 15</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">[2021-11-29T14:17:59.021] _job_complete: JobId=4539 done</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif"> </span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif"> </span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">(**)</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif"># sacct --format=JobID,JobName,ReqCPUS,ReqMem,Start,State,CPUTime,MaxRSS | grep -f /tmp/job15</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4533              xterm        1       16Gn 2021-11-24T16:31:32     FAILED 4-21:46:14           
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4533.batch        batch        1       16Gn 2021-11-24T16:31:32  CANCELLED 4-21:46:14   8893664K
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4533.extern      extern        1       16Gn 2021-11-24T16:31:32  COMPLETED 4-21:46:11          0
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4539              xterm       16       16Gn 2021-11-24T16:34:25     FAILED 78-11:37:04            </span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4539.batch        batch       16       16Gn 2021-11-24T16:34:25  CANCELLED 78-11:37:04  23781384K
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4539.extern      extern       16       16Gn 2021-11-24T16:34:25  COMPLETED 78-11:32:48          0
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4546              xterm       16       16Gn 2021-11-24T17:17:54     FAILED 77-23:56:48           
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4546.batch        batch       16       16Gn 2021-11-24T17:17:54  CANCELLED 77-23:56:48  18541468K
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4546.extern      extern       16       16Gn 2021-11-24T17:17:54  COMPLETED 77-23:56:00          0
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4663              xterm        1       12Gn 2021-11-26T16:51:12     FAILED 2-21:26:47           
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4663.batch        batch        1       12Gn 2021-11-26T16:51:12  CANCELLED 2-21:26:47   2275232K
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4663.extern      extern        1       12Gn 2021-11-26T16:51:12  COMPLETED 2-21:26:34          0
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4664              xterm        1       12Gn 2021-11-26T17:13:42     FAILED 2-21:04:17           
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4664.batch        batch        1       12Gn 2021-11-26T17:13:42  CANCELLED 2-21:04:17   1484036K
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4664.extern      extern        1       12Gn 2021-11-26T17:13:42  COMPLETED 2-21:04:17          0
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4665              xterm        1        8Gn 2021-11-26T17:18:12     FAILED 2-20:59:30           
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4665.batch        batch        1        8Gn 2021-11-26T17:18:12  CANCELLED 2-20:59:30   1159140K
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4665.extern      extern        1        8Gn 2021-11-26T17:18:12  COMPLETED 2-20:59:27          0
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4666              xterm        1        8Gn 2021-11-26T17:22:12     FAILED 2-20:55:30           
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4666.batch        batch        1        8Gn 2021-11-26T17:22:12  CANCELLED 2-20:55:30   2090708K
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4666.extern      extern        1        8Gn 2021-11-26T17:22:12  COMPLETED 2-20:55:27          0
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4711              xterm        4        3Gn 2021-11-29T14:47:09     FAILED   00:20:08           
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4711.batch        batch        4        3Gn 2021-11-29T14:47:09  CANCELLED   00:20:08     37208K
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4711.extern      extern        4        3Gn 2021-11-29T14:47:09  COMPLETED   00:20:00          0
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4714          deckbuild       10       30Gn 2021-11-29T14:51:46     FAILED   00:05:20           
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4714.batch        batch       10       30Gn 2021-11-29T14:51:46  CANCELLED   00:05:20      4036K
</span><u></u><u></u></p>
<p class="MsoNormal"><span style="font-size:11pt;font-family:"Calibri",sans-serif">4714.extern      extern       10       30Gn 2021-11-29T14:51:46  COMPLETED   00:05:10          0</span><u></u><u></u></p>
</div>
</div>
</blockquote>
</div>
<p class="MsoNormal"><br clear="all">
<br>
-- <u></u><u></u></p>
<div>
<div>
<div>
<pre>  <span style="color:rgb(133,12,27)">/|</span>       |<u></u><u></u></pre>
<pre>  <span style="color:rgb(133,12,27)">\/</span>       | <b><span style="color:rgb(51,88,104)">Yair Yarom </span></b><span style="color:rgb(51,88,104)">| System Group (DevOps)</span><u></u><u></u></pre>
<pre>  <span style="color:rgb(92,181,149)">[]</span>       | <b><span style="color:rgb(51,88,104)">The Rachel and Selim Benin School</span></b><u></u><u></u></pre>
<pre>  <span style="color:rgb(92,181,149)">[]</span> <span style="color:rgb(133,12,27)">/\</span>    | <b><span style="color:rgb(51,88,104)">of Computer Science and Engineering</span></b><u></u><u></u></pre>
<pre>  <span style="color:rgb(92,181,149)">[]</span><span style="color:rgb(0,161,146)">//</span><span style="color:rgb(133,12,27)">\\</span><span style="color:rgb(49,154,184)">/</span>  | <span style="color:rgb(51,88,104)">The Hebrew University of Jerusalem</span><u></u><u></u></pre>
<pre>  <span style="color:rgb(92,181,149)">[</span><span style="color:rgb(1,84,76)">/</span><span style="color:rgb(0,161,146)">/</span>  <span style="color:rgb(41,16,22)">\\</span>  | <span style="color:rgb(51,88,104)">T +972-2-5494522 | F +972-2-5494522</span><u></u><u></u></pre>
<pre>  <span style="color:rgb(1,84,76)">//</span>    <span style="color:rgb(21,122,134)">\</span>  | <span style="color:rgb(51,88,104)"><a href="mailto:irush@cs.huji.ac.il" target="_blank">irush@cs.huji.ac.il</a></span><u></u><u></u></pre>
<pre> <span style="color:rgb(127,130,103)">/</span><span style="color:rgb(1,84,76)">/</span>        |<u></u><u></u></pre>
</div>
</div>
</div>
</div>
</div>

</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">
    <div>
      <pre style="font-family:monospace">  <span style="color:rgb(133,12,27)">/|</span>       |
  <span style="color:rgb(133,12,27)">\/</span>       | <span style="color:rgb(51,88,104);font-weight:bold">Yair Yarom </span><span style="color:rgb(51,88,104)">| System Group (DevOps)</span>
  <span style="color:rgb(92,181,149)">[]</span>       | <span style="color:rgb(51,88,104);font-weight:bold">The Rachel and Selim Benin School</span>
  <span style="color:rgb(92,181,149)">[]</span> <span style="color:rgb(133,12,27)">/\</span>    | <span style="color:rgb(51,88,104);font-weight:bold">of Computer Science and Engineering</span>
  <span style="color:rgb(92,181,149)">[]</span><span style="color:rgb(0,161,146)">//</span><span style="color:rgb(133,12,27)">\</span><span style="color:rgb(133,12,27)">\</span><span style="color:rgb(49,154,184)">/</span>  | <span style="color:rgb(51,88,104)">The Hebrew University of Jerusalem</span>
  <span style="color:rgb(92,181,149)">[</span><span style="color:rgb(1,84,76)">/</span><span style="color:rgb(0,161,146)">/</span>  <span style="color:rgb(41,16,22)">\</span><span style="color:rgb(41,16,22)">\</span>  | <span style="color:rgb(51,88,104)">T +972-2-5494522 | F +972-2-5494522</span>
  <span style="color:rgb(1,84,76)">//</span>    <span style="color:rgb(21,122,134)">\</span>  | <span style="color:rgb(51,88,104)"><a href="mailto:irush@cs.huji.ac.il" target="_blank">irush@cs.huji.ac.il</a></span>
 <span style="color:rgb(127,130,103)">/</span><span style="color:rgb(1,84,76)">/</span>        |
</pre>
    </div>
  

</div></div>