<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Hello,</p>
<p>I am having an odd problem where users are unable to kill their
jobs with scancel. Users can submit jobs just fine and when the
task completes it is able to close correctly. However, if a user
attempts to cancel a job via scancel the SIGKILL signals are sent
to the step but don't complete. Slurmd then continues to send
SIGKILL requests until the UnkillableTimeout is hit, the slurm job
is exits with an error, the node enters a draining state, and the
spawn processes continue to run on the node.</p>
<p>I'm at a loss because jobs can complete without issue which seems
to suggest it's not a networking or permissions issue for the
slurm to do job accounting tasks. A user can ssh to the node once
a job is submitted and kill the subprocesses manually at which
point slurm completes the epilog and the node returns to idle.</p>
<p>Does anyone know what may be causing such behavior? Please let me
know any slurm.conf or cgroup.conf settings that would be helpful
to diagnose this issue. I'm quite stumped by this one.<br>
</p>
<div class="moz-signature">-- <br>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title></title>
<table cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td width="150" valign="top" height="30" align="left">
<p style="font-size:14px;">Willy Markuske</p>
</td>
</tr>
<tr>
<td style="border-right: 1px solid #000;" align="left">
<p style="font-size:12px;">HPC Systems Engineer</p>
</td>
<td rowspan="3" width="180" valign="center" height="42"
align="center"><tt><img moz-do-not-send="false"
src="cid:part1.F6E72E90.19E50DD4@sdsc.edu" alt=""
width="168" height="48"></tt> </td>
</tr>
<tr>
<td style="border-right: 1px solid #000;" align="left">
<p style="font-size:12px;">Research Data Services</p>
</td>
</tr>
<tr>
<td style="border-right: 1px solid #000;" align="left">
<p style="font-size:12px;">P: (858) 246-5593</p>
</td>
</tr>
</tbody>
</table>
</div>
</body>
</html>