[slurm-users] Users can't scancel

Wed Nov 18 18:33:13 UTC 2020

The epilog script does have exit 0 set at the end. Epilogs exit cleanly 
when run.

With log set to debug5 I get the following results for any scancel call.

Submit host slurmctld.log

> [2020-11-18T10:19:34.944] _slurm_rpc_submit_batch_job: JobId=110 
> InitPrio=110503 usec=191
> [2020-11-18T10:19:35.231] debug:  sched: Running job scheduler
> [2020-11-18T10:19:35.231] sched: Allocate JobId=110 
> NodeList=martin-cn-01 #CPUs=1 Partition=martin-compute
> [2020-11-18T10:19:35.387] prolog_running_decr: Configuration for 
> JobId=110 is complete
> [2020-11-18T10:19:39.769] _slurm_rpc_kill_job: REQUEST_KILL_JOB 
> JobId=110 uid 0
> [2020-11-18T10:19:39.770] job_signal: 9 of running JobId=110 
> successful 0x8004
> [2020-11-18T10:20:59.282] Resending TERMINATE_JOB request JobId=110 
> Nodelist=martin-cn-01
> [2020-11-18T10:21:10.000] update_node: node martin-cn-01 reason set 
> to: Kill task failed
> [2020-11-18T10:21:10.000] update_node: node martin-cn-01 state set to 
> DRAINING
> [2020-11-18T10:21:10.001] error: slurmd error running JobId=110 on 
> node(s)=martin-cn-01: Kill task failed
> [2020-11-18T10:21:10.124] cleanup_completing: JobId=110 completion 
> process took 91 seconds
Compute node slurmd.log

> [2020-11-18T10:19:35.388] task_p_slurmd_batch_request: 110
> [2020-11-18T10:19:35.388] task/affinity: job 110 CPU input mask for 
> node: 0x00000000000000000000000000000003
> [2020-11-18T10:19:35.388] task/affinity: job 110 CPU final HW mask for 
> node: 0x00000000000000010000000000000001
> [2020-11-18T10:19:35.388] debug:  [job 110] attempting to run prolog 
> [/cm/local/apps/cmd/scripts/prolog]
> [2020-11-18T10:19:35.392] _run_prolog: run job script took usec=4085
> [2020-11-18T10:19:35.392] _run_prolog: prolog with lock for job 110 
> ran for 0 seconds
> [2020-11-18T10:19:35.393] Launching batch job 110 for UID 0
> [2020-11-18T10:19:35.398] debug:  AcctGatherEnergy NONE plugin loaded
> [2020-11-18T10:19:35.398] debug:  AcctGatherProfile NONE plugin loaded
> [2020-11-18T10:19:35.398] debug:  AcctGatherInterconnect NONE plugin 
> loaded
> [2020-11-18T10:19:35.398] debug:  AcctGatherFilesystem NONE plugin loaded
> [2020-11-18T10:19:35.398] debug:  switch NONE plugin loaded
> [2020-11-18T10:19:35.398] debug:  init: Gres GPU plugin loaded
> [2020-11-18T10:19:35.400] [110.batch] Considering each NUMA node as a 
> socket
> [2020-11-18T10:19:35.400] [110.batch] debug:  CPUs:128 Boards:1 
> Sockets:8 CoresPerSocket:8 ThreadsPerCore:2
> [2020-11-18T10:19:35.400] [110.batch] debug:  Job accounting gather 
> cgroup plugin loaded
> [2020-11-18T10:19:35.401] [110.batch] debug:  cont_id hasn't been set 
> yet not running poll
> [2020-11-18T10:19:35.401] [110.batch] debug:  laying out the 1 tasks 
> on 1 hosts martin-cn-01 dist 2
> [2020-11-18T10:19:35.401] [110.batch] debug:  Message thread started 
> pid = 267029
> [2020-11-18T10:19:35.401] [110.batch] task affinity plugin loaded with 
> CPU mask 
> 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffffffffffffffffffff
> [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: now 
> constraining jobs allocated cores
> [2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup/memory: 
> total:515440M allowed:100%(enforced), swap:0%(enforced), 
> max:100%(515440M) max+swap:100%(1030880M) min:30M kmem:100%(515440M 
> enforced) min:30M swappiness:100(set)
> [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: now 
> constraining jobs allocated memory
> [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: unable to 
> open 
> /cm/shared/apps/slurm/var/etc/flamingo/cgroup_allowed_devices_file.conf: 
> No such file or directory
> [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: now 
> constraining jobs allocated devices
> [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: loaded
> [2020-11-18T10:19:35.402] [110.batch] debug:  Checkpoint plugin 
> loaded: checkpoint/none
> [2020-11-18T10:19:35.402] [110.batch] Munge credential signature 
> plugin loaded
> [2020-11-18T10:19:35.402] [110.batch] debug:  job_container none 
> plugin loaded
> [2020-11-18T10:19:35.402] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/freezer/slurm' already exists
> [2020-11-18T10:19:35.402] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/freezer/slurm/uid_0' already exists
> [2020-11-18T10:19:35.403] [110.batch] debug:  spank: opening plugin 
> stack /etc/slurm/plugstack.conf
> [2020-11-18T10:19:35.403] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/cpuset/slurm' already exists
> [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: job 
> abstract cores are '0'
> [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: step 
> abstract cores are '0'
> [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: job 
> physical cores are '0,64'
> [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: step 
> physical cores are '0,64'
> [2020-11-18T10:19:35.403] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/cpuset/slurm/uid_0' already exists
> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/memory/slurm' already exists
> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/memory/slurm/uid_0' already exists
> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists
> [2020-11-18T10:19:35.406] [110.batch] task/cgroup: 
> /slurm/uid_0/job_110: alloc=4096MB mem.limit=4096MB memsw.limit=4096MB
> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' already 
> exists
> [2020-11-18T10:19:35.406] [110.batch] task/cgroup: 
> /slurm/uid_0/job_110/step_batch: alloc=4096MB mem.limit=4096MB 
> memsw.limit=4096MB
> [2020-11-18T10:19:35.406] [110.batch] debug: _oom_event_monitor: started.
> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/devices/slurm' already exists
> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/devices/slurm/uid_0' already exists
> [2020-11-18T10:19:35.438] [110.batch] debug level = 2
> [2020-11-18T10:19:35.438] [110.batch] starting 1 tasks
> [2020-11-18T10:19:35.438] [110.batch] task 0 (267034) started 
> 2020-11-18T10:19:35
> [2020-11-18T10:19:35.438] [110.batch] debug:  Setting slurmstepd 
> oom_adj to -1000
> [2020-11-18T10:19:35.438] [110.batch] debug: 
> jobacct_gather_cgroup_cpuacct_attach_task: jobid 110 stepid 4294967294 
> taskid 0 max_task_id 0
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/cpuacct/slurm' already exists
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0' already exists
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110' already exists
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110/step_batch' already 
> exists
> [2020-11-18T10:19:35.439] [110.batch] debug: 
> jobacct_gather_cgroup_memory_attach_task: jobid 110 stepid 4294967294 
> taskid 0 max_task_id 0
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/memory/slurm' already exists
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/memory/slurm/uid_0' already exists
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' already 
> exists
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch/task_0' 
> already exists
> [2020-11-18T10:19:35.440] [110.batch] debug:  task_p_pre_launch: 
> affinity jobid 110.4294967294, task:0 bind:256
> [2020-11-18T10:19:35.440] [110.batch] task_p_pre_launch: Using 
> sched_affinity for tasks
> [2020-11-18T10:19:39.770] debug:  _rpc_terminate_job, uid = 450
> [2020-11-18T10:19:39.770] debug: task_p_slurmd_release_resources: 
> affinity jobid 110
> [2020-11-18T10:19:39.770] debug:  credential for job 110 revoked
> [2020-11-18T10:19:39.771] [110.batch] debug:  Handling 
> REQUEST_SIGNAL_CONTAINER
> [2020-11-18T10:19:39.771] [110.batch] debug: _handle_signal_container 
> for step=110.4294967294 uid=450 signal=18
> [2020-11-18T10:19:39.771] [110.batch] Sent signal 18 to 110.4294967294
> [2020-11-18T10:19:39.771] [110.batch] debug:  Handling 
> REQUEST_SIGNAL_CONTAINER
> [2020-11-18T10:19:39.771] [110.batch] debug: _handle_signal_container 
> for step=110.4294967294 uid=450 signal=15
> [2020-11-18T10:19:39.771] [110.batch] error: *** JOB 110 ON 
> martin-cn-01 CANCELLED AT 2020-11-18T10:19:39 ***
> [2020-11-18T10:19:39.771] [110.batch] Sent signal 15 to 110.4294967294
> [2020-11-18T10:19:39.771] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:19:39.792] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:19:39.842] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:19:39.942] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:19:40.442] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:19:41.442] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:19:42.443] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:19:43.443] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:19:44.443] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:19:45.443] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:19:46.443] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:19:47.443] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:19:48.444] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:19:49.444] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:19:50.444] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:20:00.444] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:20:09.445] [110.batch] debug:  Handling 
> REQUEST_STEP_TERMINATE
> [2020-11-18T10:20:09.445] [110.batch] debug:  _handle_terminate for 
> step=110.4294967294 uid=0
> [2020-11-18T10:20:09.445] [110.batch] Sent SIGKILL signal to 
> 110.4294967294
> [2020-11-18T10:20:09.445] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:20:09.465] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:20:09.515] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:20:09.616] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:20:10.116] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:20:11.116] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:20:11.116] [110.batch] debug:  Handling 
> REQUEST_STEP_TERMINATE
> [2020-11-18T10:20:11.116] [110.batch] debug:  _handle_terminate for 
> step=110.4294967294 uid=0
> [2020-11-18T10:20:11.116] [110.batch] Sent SIGKILL signal to 
> 110.4294967294
> [2020-11-18T10:20:12.116] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:20:12.116] [110.batch] debug:  Handling 
> REQUEST_STEP_TERMINATE
> [2020-11-18T10:20:12.116] [110.batch] debug:  _handle_terminate for 
> step=110.4294967294 uid=0
SIGKILL signals continue until the unkillable timelimit is met and then 
slurm closes the job leaving the processes running with output:

> [2020-11-18T10:20:59.282] debug: _rpc_terminate_job, uid = 450
> [2020-11-18T10:20:59.282] debug: task_p_slurmd_release_resources: 
> affinity jobid 110
> [2020-11-18T10:21:00.121] [110.batch] debug:  Handling REQUEST_STATE
> [2020-11-18T10:21:00.121] [110.batch] debug:  Handling 
> REQUEST_STEP_TERMINATE
> [2020-11-18T10:21:00.121] [110.batch] debug:  _handle_terminate for 
> step=110.4294967294 uid=0
> [2020-11-18T10:21:00.121] [110.batch] Sent SIGKILL signal to 
> 110.4294967294
> [2020-11-18T10:21:10.000] [110.batch] error: *** JOB 110 STEPD 
> TERMINATED ON martin-cn-01 AT 2020-11-18T10:21:09 DUE TO JOB NOT 
> ENDING WITH SIGNALS ***
> [2020-11-18T10:21:10.000] [110.batch] debug:  get_exit_code task 0 
> killed by cmd
> [2020-11-18T10:21:10.000] [110.batch] job 110 completed with slurm_rc 
> = 4001, job_rc = -1
> [2020-11-18T10:21:10.000] [110.batch] sending 
> REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status:-1
> [2020-11-18T10:21:10.001] [110.batch] debug:  Message thread exited
> [2020-11-18T10:21:10.001] [110.batch] done with job
> [2020-11-18T10:21:10.121] debug:  Waiting for job 110's prolog to complete
> [2020-11-18T10:21:10.121] debug:  Finished wait for job 110's prolog 
> to complete
> [2020-11-18T10:21:10.121] debug:  [job 110] attempting to run epilog 
> [/cm/local/apps/cmd/scripts/epilog]
> [2020-11-18T10:21:10.124] debug:  completed epilog for jobid 110
> [2020-11-18T10:21:10.124] debug:  Job 110: sent epilog complete msg: 
> rc = 0

Willy Markuske

HPC Systems Engineer

Research Data Services

P: (858) 246-5593

On 11/18/20 9:41 AM, mercan wrote:
> Hi;
>
> Check epilog return value which comes from the return value of the 
> last line of epilog script. Also, you can add a "exit 0" line at the 
> last line of the epilog script to ensure to get a zero return value 
> for testing purpose.
>
> Ahmet M.
>
>
> 18.11.2020 20:00 tarihinde William Markuske yazdı:
>>
>> Hello,
>>
>> I am having an odd problem where users are unable to kill their jobs 
>> with scancel. Users can submit jobs just fine and when the task 
>> completes it is able to close correctly. However, if a user attempts 
>> to cancel a job via scancel the SIGKILL signals are sent to the step 
>> but don't complete. Slurmd then continues to send SIGKILL requests 
>> until the UnkillableTimeout is hit, the slurm job is exits with an 
>> error, the node enters a draining state, and the spawn processes 
>> continue to run on the node.
>>
>> I'm at a loss because jobs can complete without issue which seems to 
>> suggest it's not a networking or permissions issue for the slurm to 
>> do job accounting tasks. A user can ssh to the node once a job is 
>> submitted and kill the subprocesses manually at which point slurm 
>> completes the epilog and the node returns to idle.
>>
>> Does anyone know what may be causing such behavior? Please let me 
>> know any slurm.conf or cgroup.conf settings that would be helpful to 
>> diagnose this issue. I'm quite stumped by this one.
>>
>> -- 
>>
>> Willy Markuske
>>
>> HPC Systems Engineer
>>
>>
>>
>> Research Data Services
>>
>> P: (858) 246-5593
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201118/84862e9f/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SDSClogo-plusname-red.jpg
Type: image/jpeg
Size: 9464 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201118/84862e9f/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xD42F81D406AC0BA2.asc
Type: application/pgp-keys
Size: 3228 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201118/84862e9f/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201118/84862e9f/attachment-0001.sig>