<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Slurm is trying to kill the job that is exceeding it's time
limit, but the job doesn't die, so Slurm marks the node down
because it sees this as a problem with the node. Increasing the
value for GraceTime or KillWait might help:</p>
<p><br>
</p>
<p>
<blockquote type="cite"><dt><b>GraceTime</b></dt>
<dd>
Specifies, in units of seconds, the preemption grace time
to be extended to a job which has been selected for
preemption.
The default value is zero, no preemption grace time is allowed
on
this partition.
Once a job has been selected for preemption, its end time is
set to the current
time plus GraceTime. The job's tasks are immediately sent
SIGCONT and SIGTERM
signals in order to provide notification of its imminent
termination.
This is followed by the SIGCONT, SIGTERM and SIGKILL signal
sequence upon
reaching its new end time. This second set of signals is sent
to both the
tasks <b>and</b> the containing batch script, if applicable.
Meaningful only for PreemptMode=CANCEL.
See also the global <b>KillWait</b> configuration parameter.
</dd>
</blockquote>
<br>
</p>
<blockquote type="cite"><dt><b>KillWait</b></dt>
<dd>
The interval, in seconds, given to a job's processes between the
SIGTERM and SIGKILL signals upon reaching its time limit.
If the job fails to terminate gracefully in the interval
specified,
it will be forcibly terminated.
The default value is 30 seconds.
The value may not exceed 65533.
</dd>
</blockquote>
<p><br>
</p>
<p>--<br>
Prentice<br>
</p>
<br>
<div class="moz-cite-prefix">On 3/19/19 7:21 AM, Taras Shapovalov
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAJr6v3GYRpvWmTGVSR1vKX9zvXTVtvkebDK+CD3MsVD1yH+KKw@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">Hey guys,
<div><br>
</div>
<div>When a job max time is exceeded, then Slurm tries to kill
the job and fails:<br>
</div>
<div><br>
</div>
<div><span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">[2019-03-15T09:44:03.589]
sched: _slurm_rpc_allocate_resources JobId=1325
NodeList=rn003 usec=355 </span><br
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">
<span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">[2019-03-15T09:44:03.928]
prolog_running_decr: Configuration for JobID=1325 is
complete </span><br
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">
<span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">[2019-03-15T09:45:12.739]
Time limit exhausted for JobId=1325 </span><br
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">
<span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">[2019-03-15T09:45:44.001]
_slurm_rpc_complete_job_allocation: JobID=1325 State=0x8006
NodeCnt=1 error Job/step already completing or completed </span><br
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">
<span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">[2019-03-15T09:46:12.805]
Resending TERMINATE_JOB request JobId=1325 Nodelist=rn003 </span><br
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">
<span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">[2019-03-15T09:48:43.000]
update_node: node rn003 reason set to: Kill task failed </span><br
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">
<span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">[2019-03-15T09:48:43.000]
update_node: node rn003 state set to DRAINING </span><br
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">
<span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">[2019-03-15T09:48:43.000]
got (nil) </span><br
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">
<span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">[2019-03-15T09:48:43.816]
cleanup_completing: job 1325 completion process took 211
seconds </span><br>
</div>
<div><span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px"><br>
</span></div>
<div><span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">This happens even on
very simple "srun bash" jobs that exceed their time limits.
Have you idea what does it mean? Upgrade to the latest did
not help.</span></div>
<div><span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px"><br>
</span></div>
<div><span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px"><br>
</span></div>
<div><span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">Best regards,</span></div>
<div><span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px"><br>
</span></div>
<div><span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
UI",Roboto,Oxygen,Ubuntu,"Fira
Sans","Droid Sans","Helvetica
Neue",sans-serif;font-size:14px">Taras</span></div>
</div>
</blockquote>
<pre class="moz-signature" cols="72">
</pre>
</body>
</html>