<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>The epilog script does have exit 0 set at the end. Epilogs exit
cleanly when run.<br>
</p>
<p>With log set to debug5 I get the following results for any
scancel call.</p>
<p>Submit host slurmctld.log</p>
<p>
<blockquote type="cite">[2020-11-18T10:19:34.944]
_slurm_rpc_submit_batch_job: JobId=110 InitPrio=110503 usec=191<br>
[2020-11-18T10:19:35.231] debug: sched: Running job scheduler<br>
[2020-11-18T10:19:35.231] sched: Allocate JobId=110
NodeList=martin-cn-01 #CPUs=1 Partition=martin-compute<br>
[2020-11-18T10:19:35.387] prolog_running_decr: Configuration for
JobId=110 is complete<br>
[2020-11-18T10:19:39.769] _slurm_rpc_kill_job: REQUEST_KILL_JOB
JobId=110 uid 0<br>
[2020-11-18T10:19:39.770] job_signal: 9 of running JobId=110
successful 0x8004</blockquote>
<blockquote type="cite">[2020-11-18T10:20:59.282] Resending
TERMINATE_JOB request JobId=110 Nodelist=martin-cn-01</blockquote>
<blockquote type="cite">[2020-11-18T10:21:10.000] update_node:
node martin-cn-01 reason set to: Kill task failed<br>
[2020-11-18T10:21:10.000] update_node: node martin-cn-01 state
set to DRAINING<br>
[2020-11-18T10:21:10.001] error: slurmd error running JobId=110
on node(s)=martin-cn-01: Kill task failed<br>
[2020-11-18T10:21:10.124] cleanup_completing: JobId=110
completion process took 91 seconds<br>
</blockquote>
Compute node slurmd.log</p>
<p>
<blockquote type="cite">[2020-11-18T10:19:35.388]
task_p_slurmd_batch_request: 110<br>
[2020-11-18T10:19:35.388] task/affinity: job 110 CPU input mask
for node: 0x00000000000000000000000000000003<br>
[2020-11-18T10:19:35.388] task/affinity: job 110 CPU final HW
mask for node: 0x00000000000000010000000000000001<br>
[2020-11-18T10:19:35.388] debug: [job 110] attempting to run
prolog [/cm/local/apps/cmd/scripts/prolog]<br>
[2020-11-18T10:19:35.392] _run_prolog: run job script took
usec=4085<br>
[2020-11-18T10:19:35.392] _run_prolog: prolog with lock for job
110 ran for 0 seconds<br>
[2020-11-18T10:19:35.393] Launching batch job 110 for UID 0<br>
[2020-11-18T10:19:35.398] debug: AcctGatherEnergy NONE plugin
loaded<br>
[2020-11-18T10:19:35.398] debug: AcctGatherProfile NONE plugin
loaded<br>
[2020-11-18T10:19:35.398] debug: AcctGatherInterconnect NONE
plugin loaded<br>
[2020-11-18T10:19:35.398] debug: AcctGatherFilesystem NONE
plugin loaded<br>
[2020-11-18T10:19:35.398] debug: switch NONE plugin loaded<br>
[2020-11-18T10:19:35.398] debug: init: Gres GPU plugin loaded<br>
[2020-11-18T10:19:35.400] [110.batch] Considering each NUMA node
as a socket<br>
[2020-11-18T10:19:35.400] [110.batch] debug: CPUs:128 Boards:1
Sockets:8 CoresPerSocket:8 ThreadsPerCore:2<br>
[2020-11-18T10:19:35.400] [110.batch] debug: Job accounting
gather cgroup plugin loaded<br>
[2020-11-18T10:19:35.401] [110.batch] debug: cont_id hasn't
been set yet not running poll<br>
[2020-11-18T10:19:35.401] [110.batch] debug: laying out the 1
tasks on 1 hosts martin-cn-01 dist 2<br>
[2020-11-18T10:19:35.401] [110.batch] debug: Message thread
started pid = 267029<br>
[2020-11-18T10:19:35.401] [110.batch] task affinity plugin
loaded with CPU mask
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffffffffffffffffffff<br>
[2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: now
constraining jobs allocated cores<br>
[2020-11-18T10:19:35.402] [110.batch] debug:
task/cgroup/memory: total:515440M allowed:100%(enforced),
swap:0%(enforced), max:100%(515440M) max+swap:100%(1030880M)
min:30M kmem:100%(515440M enforced) min:30M swappiness:100(set)<br>
[2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: now
constraining jobs allocated memory<br>
[2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup:
unable to open
/cm/shared/apps/slurm/var/etc/flamingo/cgroup_allowed_devices_file.conf:
No such file or directory<br>
[2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: now
constraining jobs allocated devices<br>
[2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup:
loaded<br>
[2020-11-18T10:19:35.402] [110.batch] debug: Checkpoint plugin
loaded: checkpoint/none<br>
[2020-11-18T10:19:35.402] [110.batch] Munge credential signature
plugin loaded<br>
[2020-11-18T10:19:35.402] [110.batch] debug: job_container none
plugin loaded<br>
[2020-11-18T10:19:35.402] [110.batch] debug:
xcgroup_instantiate: cgroup '/sys/fs/cgroup/freezer/slurm'
already exists<br>
[2020-11-18T10:19:35.402] [110.batch] debug:
xcgroup_instantiate: cgroup '/sys/fs/cgroup/freezer/slurm/uid_0'
already exists<br>
[2020-11-18T10:19:35.403] [110.batch] debug: spank: opening
plugin stack /etc/slurm/plugstack.conf<br>
[2020-11-18T10:19:35.403] [110.batch] debug:
xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm'
already exists<br>
[2020-11-18T10:19:35.403] [110.batch] debug: task/cgroup: job
abstract cores are '0'<br>
[2020-11-18T10:19:35.403] [110.batch] debug: task/cgroup: step
abstract cores are '0'<br>
[2020-11-18T10:19:35.403] [110.batch] debug: task/cgroup: job
physical cores are '0,64'<br>
[2020-11-18T10:19:35.403] [110.batch] debug: task/cgroup: step
physical cores are '0,64'<br>
[2020-11-18T10:19:35.403] [110.batch] debug:
xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm/uid_0'
already exists<br>
[2020-11-18T10:19:35.406] [110.batch] debug:
xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm'
already exists<br>
[2020-11-18T10:19:35.406] [110.batch] debug:
xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_0'
already exists<br>
[2020-11-18T10:19:35.406] [110.batch] debug:
xcgroup_instantiate: cgroup
'/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists<br>
[2020-11-18T10:19:35.406] [110.batch] task/cgroup:
/slurm/uid_0/job_110: alloc=4096MB mem.limit=4096MB
memsw.limit=4096MB<br>
[2020-11-18T10:19:35.406] [110.batch] debug:
xcgroup_instantiate: cgroup
'/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' already
exists<br>
[2020-11-18T10:19:35.406] [110.batch] task/cgroup:
/slurm/uid_0/job_110/step_batch: alloc=4096MB mem.limit=4096MB
memsw.limit=4096MB<br>
[2020-11-18T10:19:35.406] [110.batch] debug:
_oom_event_monitor: started.<br>
[2020-11-18T10:19:35.406] [110.batch] debug:
xcgroup_instantiate: cgroup '/sys/fs/cgroup/devices/slurm'
already exists<br>
[2020-11-18T10:19:35.406] [110.batch] debug:
xcgroup_instantiate: cgroup '/sys/fs/cgroup/devices/slurm/uid_0'
already exists<br>
[2020-11-18T10:19:35.438] [110.batch] debug level = 2<br>
[2020-11-18T10:19:35.438] [110.batch] starting 1 tasks<br>
[2020-11-18T10:19:35.438] [110.batch] task 0 (267034) started
2020-11-18T10:19:35<br>
[2020-11-18T10:19:35.438] [110.batch] debug: Setting slurmstepd
oom_adj to -1000<br>
[2020-11-18T10:19:35.438] [110.batch] debug:
jobacct_gather_cgroup_cpuacct_attach_task: jobid 110 stepid
4294967294 taskid 0 max_task_id 0<br>
[2020-11-18T10:19:35.439] [110.batch] debug:
xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm'
already exists<br>
[2020-11-18T10:19:35.439] [110.batch] debug:
xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0'
already exists<br>
[2020-11-18T10:19:35.439] [110.batch] debug:
xcgroup_instantiate: cgroup
'/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110' already exists<br>
[2020-11-18T10:19:35.439] [110.batch] debug:
xcgroup_instantiate: cgroup
'/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110/step_batch' already
exists<br>
[2020-11-18T10:19:35.439] [110.batch] debug:
jobacct_gather_cgroup_memory_attach_task: jobid 110 stepid
4294967294 taskid 0 max_task_id 0<br>
[2020-11-18T10:19:35.439] [110.batch] debug:
xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm'
already exists<br>
[2020-11-18T10:19:35.439] [110.batch] debug:
xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_0'
already exists<br>
[2020-11-18T10:19:35.439] [110.batch] debug:
xcgroup_instantiate: cgroup
'/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists<br>
[2020-11-18T10:19:35.439] [110.batch] debug:
xcgroup_instantiate: cgroup
'/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' already
exists<br>
[2020-11-18T10:19:35.439] [110.batch] debug:
xcgroup_instantiate: cgroup
'/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch/task_0'
already exists<br>
[2020-11-18T10:19:35.440] [110.batch] debug: task_p_pre_launch:
affinity jobid 110.4294967294, task:0 bind:256<br>
[2020-11-18T10:19:35.440] [110.batch] task_p_pre_launch: Using
sched_affinity for tasks<br>
[2020-11-18T10:19:39.770] debug: _rpc_terminate_job, uid = 450<br>
[2020-11-18T10:19:39.770] debug:
task_p_slurmd_release_resources: affinity jobid 110<br>
[2020-11-18T10:19:39.770] debug: credential for job 110 revoked<br>
[2020-11-18T10:19:39.771] [110.batch] debug: Handling
REQUEST_SIGNAL_CONTAINER<br>
[2020-11-18T10:19:39.771] [110.batch] debug:
_handle_signal_container for step=110.4294967294 uid=450
signal=18<br>
[2020-11-18T10:19:39.771] [110.batch] Sent signal 18 to
110.4294967294<br>
[2020-11-18T10:19:39.771] [110.batch] debug: Handling
REQUEST_SIGNAL_CONTAINER<br>
[2020-11-18T10:19:39.771] [110.batch] debug:
_handle_signal_container for step=110.4294967294 uid=450
signal=15<br>
[2020-11-18T10:19:39.771] [110.batch] error: *** JOB 110 ON
martin-cn-01 CANCELLED AT 2020-11-18T10:19:39 ***<br>
[2020-11-18T10:19:39.771] [110.batch] Sent signal 15 to
110.4294967294<br>
[2020-11-18T10:19:39.771] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:19:39.792] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:19:39.842] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:19:39.942] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:19:40.442] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:19:41.442] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:19:42.443] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:19:43.443] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:19:44.443] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:19:45.443] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:19:46.443] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:19:47.443] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:19:48.444] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:19:49.444] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:19:50.444] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:20:00.444] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:20:09.445] [110.batch] debug: Handling
REQUEST_STEP_TERMINATE<br>
[2020-11-18T10:20:09.445] [110.batch] debug: _handle_terminate
for step=110.4294967294 uid=0<br>
[2020-11-18T10:20:09.445] [110.batch] Sent SIGKILL signal to
110.4294967294<br>
[2020-11-18T10:20:09.445] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:20:09.465] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:20:09.515] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:20:09.616] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:20:10.116] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:20:11.116] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:20:11.116] [110.batch] debug: Handling
REQUEST_STEP_TERMINATE<br>
[2020-11-18T10:20:11.116] [110.batch] debug: _handle_terminate
for step=110.4294967294 uid=0<br>
[2020-11-18T10:20:11.116] [110.batch] Sent SIGKILL signal to
110.4294967294<br>
[2020-11-18T10:20:12.116] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:20:12.116] [110.batch] debug: Handling
REQUEST_STEP_TERMINATE<br>
[2020-11-18T10:20:12.116] [110.batch] debug: _handle_terminate
for step=110.4294967294 uid=0<br>
</blockquote>
SIGKILL signals continue until the unkillable timelimit is met and
then slurm closes the job leaving the processes running with
output:<br>
</p>
<p>
<blockquote type="cite">[2020-11-18T10:20:59.282] debug:
_rpc_terminate_job, uid = 450<br>
[2020-11-18T10:20:59.282] debug:
task_p_slurmd_release_resources: affinity jobid 110<br>
[2020-11-18T10:21:00.121] [110.batch] debug: Handling
REQUEST_STATE<br>
[2020-11-18T10:21:00.121] [110.batch] debug: Handling
REQUEST_STEP_TERMINATE<br>
[2020-11-18T10:21:00.121] [110.batch] debug: _handle_terminate
for step=110.4294967294 uid=0<br>
[2020-11-18T10:21:00.121] [110.batch] Sent SIGKILL signal to
110.4294967294<br>
[2020-11-18T10:21:10.000] [110.batch] error: *** JOB 110 STEPD
TERMINATED ON martin-cn-01 AT 2020-11-18T10:21:09 DUE TO JOB NOT
ENDING WITH SIGNALS ***<br>
[2020-11-18T10:21:10.000] [110.batch] debug: get_exit_code task
0 killed by cmd<br>
[2020-11-18T10:21:10.000] [110.batch] job 110 completed with
slurm_rc = 4001, job_rc = -1<br>
[2020-11-18T10:21:10.000] [110.batch] sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status:-1<br>
[2020-11-18T10:21:10.001] [110.batch] debug: Message thread
exited<br>
[2020-11-18T10:21:10.001] [110.batch] done with job<br>
[2020-11-18T10:21:10.121] debug: Waiting for job 110's prolog
to complete<br>
[2020-11-18T10:21:10.121] debug: Finished wait for job 110's
prolog to complete<br>
[2020-11-18T10:21:10.121] debug: [job 110] attempting to run
epilog [/cm/local/apps/cmd/scripts/epilog]<br>
[2020-11-18T10:21:10.124] debug: completed epilog for jobid 110<br>
[2020-11-18T10:21:10.124] debug: Job 110: sent epilog complete
msg: rc = 0<br>
</blockquote>
</p>
<div class="moz-signature">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title></title>
<table cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td width="150" valign="top" height="30" align="left">
<p style="font-size:14px;">Willy Markuske</p>
</td>
</tr>
<tr>
<td style="border-right: 1px solid #000;" align="left">
<p style="font-size:12px;">HPC Systems Engineer</p>
</td>
<td rowspan="3" width="180" valign="center" height="42"
align="center"><tt><img moz-do-not-send="false"
src="cid:part1.152B416D.527E59F0@sdsc.edu" alt=""
width="168" height="48"></tt> </td>
</tr>
<tr>
<td style="border-right: 1px solid #000;" align="left">
<p style="font-size:12px;">Research Data Services</p>
</td>
</tr>
<tr>
<td style="border-right: 1px solid #000;" align="left">
<p style="font-size:12px;">P: (858) 246-5593</p>
</td>
</tr>
</tbody>
</table>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">On 11/18/20 9:41 AM, mercan wrote:<br>
</div>
<blockquote type="cite"
cite="mid:9f9643ff-bf36-facd-d228-7569d43a5301@uhem.itu.edu.tr">Hi;
<br>
<br>
Check epilog return value which comes from the return value of the
last line of epilog script. Also, you can add a "exit 0" line at
the last line of the epilog script to ensure to get a zero return
value for testing purpose.
<br>
<br>
Ahmet M.
<br>
<br>
<br>
18.11.2020 20:00 tarihinde William Markuske yazdı:
<br>
<blockquote type="cite">
<br>
Hello,
<br>
<br>
I am having an odd problem where users are unable to kill their
jobs with scancel. Users can submit jobs just fine and when the
task completes it is able to close correctly. However, if a user
attempts to cancel a job via scancel the SIGKILL signals are
sent to the step but don't complete. Slurmd then continues to
send SIGKILL requests until the UnkillableTimeout is hit, the
slurm job is exits with an error, the node enters a draining
state, and the spawn processes continue to run on the node.
<br>
<br>
I'm at a loss because jobs can complete without issue which
seems to suggest it's not a networking or permissions issue for
the slurm to do job accounting tasks. A user can ssh to the node
once a job is submitted and kill the subprocesses manually at
which point slurm completes the epilog and the node returns to
idle.
<br>
<br>
Does anyone know what may be causing such behavior? Please let
me know any slurm.conf or cgroup.conf settings that would be
helpful to diagnose this issue. I'm quite stumped by this one.
<br>
<br>
-- <br>
<br>
Willy Markuske
<br>
<br>
HPC Systems Engineer
<br>
<br>
<br>
<br>
Research Data Services
<br>
<br>
P: (858) 246-5593
<br>
<br>
</blockquote>
</blockquote>
</body>
</html>