<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Thanks for the info and link to your bug report. Unfortunately,
my GraceTime is already set to zero for that QOS: <br>
</p>
<pre>$ sacctmgr show qos interruptible format=Name,gracetime </pre>
<pre> Name GraceTime </pre>
<pre>---------- ---------- </pre>
<pre>interrupt+ 00:00:00 </pre>
<p><br>
</p>
<div class="moz-cite-prefix">On 2/26/21 3:58 PM, Michael Robbert
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:C311A8DA-6454-4065-BC32-B419FF04683D@mines.edu">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
<style>@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face
{font-family:Consolas;
panose-1:2 11 6 9 2 2 4 3 2 4;}p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}pre
{mso-style-priority:99;
mso-style-link:"HTML Preformatted Char";
margin:0in;
font-size:10.0pt;
font-family:"Courier New";}span.sn-widget-textblock-body
{mso-style-name:sn-widget-textblock-body;}span.HTMLPreformattedChar
{mso-style-name:"HTML Preformatted Char";
mso-style-priority:99;
mso-style-link:"HTML Preformatted";
font-family:"Consolas",serif;}span.EmailStyle24
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}div.WordSection1
{page:WordSection1;}</style>
<div class="WordSection1">
<p class="MsoNormal">We saw something that sounds similar to
this. See this bug report: <a
href="https://bugs.schedmd.com/show_bug.cgi?id=10196"
moz-do-not-send="true">https://bugs.schedmd.com/show_bug.cgi?id=10196</a><o:p></o:p></p>
<p class="MsoNormal">SchedMD never found the root cause. They
thought it might have something to do with a timing problem on
Prolog scripts, but the thing that fixed it for us was to set
GraceTime=0 on our preemptable QoS.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<div>
<p class="MsoNormal"><b><span style="color:#002060">Mike
Robbert<o:p></o:p></span></b></p>
<p class="MsoNormal"><b><span style="color:#002060">Cyberinfrastructure
Specialist, Cyberinfrastructure and Advanced
Research Computing<o:p></o:p></span></b></p>
<p class="MsoNormal"><span style="color:#767171">Information
and Technology Solutions (ITS)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="color:#767171">303-273-3786
| </span><a href="mailto:mrobbert@mines.edu"
moz-do-not-send="true"><span style="color:#0563C1">mrobbert@mines.edu</span></a><span
style="color:#767171"> </span><span
style="font-size:12.0pt;color:#767171"> <o:p></o:p></span></p>
<p class="MsoNormal"><img
style="width:2.1666in;height:.3958in"
id="Picture_x0020_1"
src="cid:part3.56FD2599.9C859F1A@pppl.gov" alt="A
close up of a sign
Description automatically generated" class=""
width="208" height="38" border="0"><span
style="font-size:12.0pt;color:#767171"><o:p></o:p></span></p>
<p class="MsoNormal"><b><span style="color:#2B4160">Our
values:</span></b><span style="color:#2B4160"> </span><span
style="color:#767171">Trust | Integrity | Respect |
Responsibility</span><o:p></o:p></p>
</div>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div style="border:none;border-top:solid #B5C4DF
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span
style="font-size:12.0pt;color:black">From: </span></b><span
style="font-size:12.0pt;color:black">slurm-users
<a class="moz-txt-link-rfc2396E" href="mailto:slurm-users-bounces@lists.schedmd.com"><slurm-users-bounces@lists.schedmd.com></a> on behalf of
Prentice Bisbal <a class="moz-txt-link-rfc2396E" href="mailto:pbisbal@pppl.gov"><pbisbal@pppl.gov></a><br>
<b>Reply-To: </b>Slurm User Community List
<a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com"><slurm-users@lists.schedmd.com></a><br>
<b>Date: </b>Friday, February 26, 2021 at 12:38<br>
<b>To: </b><a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com">"slurm-users@lists.schedmd.com"</a>
<a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com"><slurm-users@lists.schedmd.com></a><br>
<b>Subject: </b>[External] [slurm-users] Preemption not
working in 20.11<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div style="border:solid #9C6500 1.0pt;padding:2.0pt 2.0pt 2.0pt
2.0pt">
<p class="MsoNormal"
style="line-height:12.0pt;background:#FFEB9C"><b><span
style="font-size:10.0pt;color:#9C6500">CAUTION:</span></b><span
style="font-size:10.0pt;color:black"> This email
originated from outside of the Colorado School of Mines
organization. Do not click on links or open attachments
unless you recognize the sender and know the content is
safe.<o:p></o:p></span></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p>We recently upgraded from Slurm 19.05.8 to 20.11.3. In our
configuration, we have an interruptible partition named
'interruptible' for long-running, low-priority jobs that use
checkpoint/restart. Jobs that are preempted would be killed
and requeued rather than suspended. This configuration has
been working without issue for 2+ years without issue. <o:p></o:p></p>
<p>After the upgrade, this has stopped working. Preempted jobs
are killed and not requeued. My slurm.conf file is
configured to requeue preempted jobs:<o:p></o:p></p>
<p>$ grep -i requeue /etc/slurm/slurm.conf <br>
#JobRequeue=1<br>
PreemptMode=Requeue<o:p></o:p></p>
<p>And the user's sbatch script included the --requeue option.
<o:p></o:p></p>
<p>The user reports the err output from his preempted jobs now
says<o:p></o:p></p>
<p><span class="sn-widget-textblock-body">slurmstepd: error:
*** STEP 1075117.0 ON greene002 CANCELLED AT
2021-02-25T16:07:48 ***</span><o:p></o:p></p>
<p><span class="sn-widget-textblock-body">And in the past it
would see PREEMPTED instead of cancelled. </span><br>
<br>
<o:p></o:p></p>
<p><span class="sn-widget-textblock-body">Any ideas what would
cause this? I've reported this to Slurm support, and
haven't gotten anything back yet, so I figured I'd ask
here, too. If this is a bug, I can't be the only one who
has experienced this. </span><br>
<br>
<o:p></o:p></p>
<pre>-- <o:p></o:p></pre>
<pre>Prentice <o:p></o:p></pre>
</div>
</div>
</blockquote>
<pre class="moz-signature" cols="72">--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
<a class="moz-txt-link-freetext" href="http://www.pppl.gov">http://www.pppl.gov</a></pre>
</body>
</html>