[slurm-users] Users can't scancel
William Markuske
wmarkuske at sdsc.edu
Wed Nov 18 18:33:13 UTC 2020
The epilog script does have exit 0 set at the end. Epilogs exit cleanly
when run.
With log set to debug5 I get the following results for any scancel call.
Submit host slurmctld.log
> [2020-11-18T10:19:34.944] _slurm_rpc_submit_batch_job: JobId=110
> InitPrio=110503 usec=191
> [2020-11-18T10:19:35.231] debug: sched: Running job scheduler
> [2020-11-18T10:19:35.231] sched: Allocate JobId=110
> NodeList=martin-cn-01 #CPUs=1 Partition=martin-compute
> [2020-11-18T10:19:35.387] prolog_running_decr: Configuration for
> JobId=110 is complete
> [2020-11-18T10:19:39.769] _slurm_rpc_kill_job: REQUEST_KILL_JOB
> JobId=110 uid 0
> [2020-11-18T10:19:39.770] job_signal: 9 of running JobId=110
> successful 0x8004
> [2020-11-18T10:20:59.282] Resending TERMINATE_JOB request JobId=110
> Nodelist=martin-cn-01
> [2020-11-18T10:21:10.000] update_node: node martin-cn-01 reason set
> to: Kill task failed
> [2020-11-18T10:21:10.000] update_node: node martin-cn-01 state set to
> DRAINING
> [2020-11-18T10:21:10.001] error: slurmd error running JobId=110 on
> node(s)=martin-cn-01: Kill task failed
> [2020-11-18T10:21:10.124] cleanup_completing: JobId=110 completion
> process took 91 seconds
Compute node slurmd.log
> [2020-11-18T10:19:35.388] task_p_slurmd_batch_request: 110
> [2020-11-18T10:19:35.388] task/affinity: job 110 CPU input mask for
> node: 0x00000000000000000000000000000003
> [2020-11-18T10:19:35.388] task/affinity: job 110 CPU final HW mask for
> node: 0x00000000000000010000000000000001
> [2020-11-18T10:19:35.388] debug: [job 110] attempting to run prolog
> [/cm/local/apps/cmd/scripts/prolog]
> [2020-11-18T10:19:35.392] _run_prolog: run job script took usec=4085
> [2020-11-18T10:19:35.392] _run_prolog: prolog with lock for job 110
> ran for 0 seconds
> [2020-11-18T10:19:35.393] Launching batch job 110 for UID 0
> [2020-11-18T10:19:35.398] debug: AcctGatherEnergy NONE plugin loaded
> [2020-11-18T10:19:35.398] debug: AcctGatherProfile NONE plugin loaded
> [2020-11-18T10:19:35.398] debug: AcctGatherInterconnect NONE plugin
> loaded
> [2020-11-18T10:19:35.398] debug: AcctGatherFilesystem NONE plugin loaded
> [2020-11-18T10:19:35.398] debug: switch NONE plugin loaded
> [2020-11-18T10:19:35.398] debug: init: Gres GPU plugin loaded
> [2020-11-18T10:19:35.400] [110.batch] Considering each NUMA node as a
> socket
> [2020-11-18T10:19:35.400] [110.batch] debug: CPUs:128 Boards:1
> Sockets:8 CoresPerSocket:8 ThreadsPerCore:2
> [2020-11-18T10:19:35.400] [110.batch] debug: Job accounting gather
> cgroup plugin loaded
> [2020-11-18T10:19:35.401] [110.batch] debug: cont_id hasn't been set
> yet not running poll
> [2020-11-18T10:19:35.401] [110.batch] debug: laying out the 1 tasks
> on 1 hosts martin-cn-01 dist 2
> [2020-11-18T10:19:35.401] [110.batch] debug: Message thread started
> pid = 267029
> [2020-11-18T10:19:35.401] [110.batch] task affinity plugin loaded with
> CPU mask
> 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffffffffffffffffffff
> [2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: now
> constraining jobs allocated cores
> [2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup/memory:
> total:515440M allowed:100%(enforced), swap:0%(enforced),
> max:100%(515440M) max+swap:100%(1030880M) min:30M kmem:100%(515440M
> enforced) min:30M swappiness:100(set)
> [2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: now
> constraining jobs allocated memory
> [2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: unable to
> open
> /cm/shared/apps/slurm/var/etc/flamingo/cgroup_allowed_devices_file.conf:
> No such file or directory
> [2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: now
> constraining jobs allocated devices
> [2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: loaded
> [2020-11-18T10:19:35.402] [110.batch] debug: Checkpoint plugin
> loaded: checkpoint/none
> [2020-11-18T10:19:35.402] [110.batch] Munge credential signature
> plugin loaded
> [2020-11-18T10:19:35.402] [110.batch] debug: job_container none
> plugin loaded
> [2020-11-18T10:19:35.402] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/freezer/slurm' already exists
> [2020-11-18T10:19:35.402] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/freezer/slurm/uid_0' already exists
> [2020-11-18T10:19:35.403] [110.batch] debug: spank: opening plugin
> stack /etc/slurm/plugstack.conf
> [2020-11-18T10:19:35.403] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/cpuset/slurm' already exists
> [2020-11-18T10:19:35.403] [110.batch] debug: task/cgroup: job
> abstract cores are '0'
> [2020-11-18T10:19:35.403] [110.batch] debug: task/cgroup: step
> abstract cores are '0'
> [2020-11-18T10:19:35.403] [110.batch] debug: task/cgroup: job
> physical cores are '0,64'
> [2020-11-18T10:19:35.403] [110.batch] debug: task/cgroup: step
> physical cores are '0,64'
> [2020-11-18T10:19:35.403] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/cpuset/slurm/uid_0' already exists
> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/memory/slurm' already exists
> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/memory/slurm/uid_0' already exists
> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists
> [2020-11-18T10:19:35.406] [110.batch] task/cgroup:
> /slurm/uid_0/job_110: alloc=4096MB mem.limit=4096MB memsw.limit=4096MB
> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' already
> exists
> [2020-11-18T10:19:35.406] [110.batch] task/cgroup:
> /slurm/uid_0/job_110/step_batch: alloc=4096MB mem.limit=4096MB
> memsw.limit=4096MB
> [2020-11-18T10:19:35.406] [110.batch] debug: _oom_event_monitor: started.
> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/devices/slurm' already exists
> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/devices/slurm/uid_0' already exists
> [2020-11-18T10:19:35.438] [110.batch] debug level = 2
> [2020-11-18T10:19:35.438] [110.batch] starting 1 tasks
> [2020-11-18T10:19:35.438] [110.batch] task 0 (267034) started
> 2020-11-18T10:19:35
> [2020-11-18T10:19:35.438] [110.batch] debug: Setting slurmstepd
> oom_adj to -1000
> [2020-11-18T10:19:35.438] [110.batch] debug:
> jobacct_gather_cgroup_cpuacct_attach_task: jobid 110 stepid 4294967294
> taskid 0 max_task_id 0
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/cpuacct/slurm' already exists
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0' already exists
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110' already exists
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110/step_batch' already
> exists
> [2020-11-18T10:19:35.439] [110.batch] debug:
> jobacct_gather_cgroup_memory_attach_task: jobid 110 stepid 4294967294
> taskid 0 max_task_id 0
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/memory/slurm' already exists
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/memory/slurm/uid_0' already exists
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' already
> exists
> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch/task_0'
> already exists
> [2020-11-18T10:19:35.440] [110.batch] debug: task_p_pre_launch:
> affinity jobid 110.4294967294, task:0 bind:256
> [2020-11-18T10:19:35.440] [110.batch] task_p_pre_launch: Using
> sched_affinity for tasks
> [2020-11-18T10:19:39.770] debug: _rpc_terminate_job, uid = 450
> [2020-11-18T10:19:39.770] debug: task_p_slurmd_release_resources:
> affinity jobid 110
> [2020-11-18T10:19:39.770] debug: credential for job 110 revoked
> [2020-11-18T10:19:39.771] [110.batch] debug: Handling
> REQUEST_SIGNAL_CONTAINER
> [2020-11-18T10:19:39.771] [110.batch] debug: _handle_signal_container
> for step=110.4294967294 uid=450 signal=18
> [2020-11-18T10:19:39.771] [110.batch] Sent signal 18 to 110.4294967294
> [2020-11-18T10:19:39.771] [110.batch] debug: Handling
> REQUEST_SIGNAL_CONTAINER
> [2020-11-18T10:19:39.771] [110.batch] debug: _handle_signal_container
> for step=110.4294967294 uid=450 signal=15
> [2020-11-18T10:19:39.771] [110.batch] error: *** JOB 110 ON
> martin-cn-01 CANCELLED AT 2020-11-18T10:19:39 ***
> [2020-11-18T10:19:39.771] [110.batch] Sent signal 15 to 110.4294967294
> [2020-11-18T10:19:39.771] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:19:39.792] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:19:39.842] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:19:39.942] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:19:40.442] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:19:41.442] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:19:42.443] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:19:43.443] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:19:44.443] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:19:45.443] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:19:46.443] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:19:47.443] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:19:48.444] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:19:49.444] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:19:50.444] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:20:00.444] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:20:09.445] [110.batch] debug: Handling
> REQUEST_STEP_TERMINATE
> [2020-11-18T10:20:09.445] [110.batch] debug: _handle_terminate for
> step=110.4294967294 uid=0
> [2020-11-18T10:20:09.445] [110.batch] Sent SIGKILL signal to
> 110.4294967294
> [2020-11-18T10:20:09.445] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:20:09.465] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:20:09.515] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:20:09.616] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:20:10.116] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:20:11.116] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:20:11.116] [110.batch] debug: Handling
> REQUEST_STEP_TERMINATE
> [2020-11-18T10:20:11.116] [110.batch] debug: _handle_terminate for
> step=110.4294967294 uid=0
> [2020-11-18T10:20:11.116] [110.batch] Sent SIGKILL signal to
> 110.4294967294
> [2020-11-18T10:20:12.116] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:20:12.116] [110.batch] debug: Handling
> REQUEST_STEP_TERMINATE
> [2020-11-18T10:20:12.116] [110.batch] debug: _handle_terminate for
> step=110.4294967294 uid=0
SIGKILL signals continue until the unkillable timelimit is met and then
slurm closes the job leaving the processes running with output:
> [2020-11-18T10:20:59.282] debug: _rpc_terminate_job, uid = 450
> [2020-11-18T10:20:59.282] debug: task_p_slurmd_release_resources:
> affinity jobid 110
> [2020-11-18T10:21:00.121] [110.batch] debug: Handling REQUEST_STATE
> [2020-11-18T10:21:00.121] [110.batch] debug: Handling
> REQUEST_STEP_TERMINATE
> [2020-11-18T10:21:00.121] [110.batch] debug: _handle_terminate for
> step=110.4294967294 uid=0
> [2020-11-18T10:21:00.121] [110.batch] Sent SIGKILL signal to
> 110.4294967294
> [2020-11-18T10:21:10.000] [110.batch] error: *** JOB 110 STEPD
> TERMINATED ON martin-cn-01 AT 2020-11-18T10:21:09 DUE TO JOB NOT
> ENDING WITH SIGNALS ***
> [2020-11-18T10:21:10.000] [110.batch] debug: get_exit_code task 0
> killed by cmd
> [2020-11-18T10:21:10.000] [110.batch] job 110 completed with slurm_rc
> = 4001, job_rc = -1
> [2020-11-18T10:21:10.000] [110.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status:-1
> [2020-11-18T10:21:10.001] [110.batch] debug: Message thread exited
> [2020-11-18T10:21:10.001] [110.batch] done with job
> [2020-11-18T10:21:10.121] debug: Waiting for job 110's prolog to complete
> [2020-11-18T10:21:10.121] debug: Finished wait for job 110's prolog
> to complete
> [2020-11-18T10:21:10.121] debug: [job 110] attempting to run epilog
> [/cm/local/apps/cmd/scripts/epilog]
> [2020-11-18T10:21:10.124] debug: completed epilog for jobid 110
> [2020-11-18T10:21:10.124] debug: Job 110: sent epilog complete msg:
> rc = 0
Willy Markuske
HPC Systems Engineer
Research Data Services
P: (858) 246-5593
On 11/18/20 9:41 AM, mercan wrote:
> Hi;
>
> Check epilog return value which comes from the return value of the
> last line of epilog script. Also, you can add a "exit 0" line at the
> last line of the epilog script to ensure to get a zero return value
> for testing purpose.
>
> Ahmet M.
>
>
> 18.11.2020 20:00 tarihinde William Markuske yazdı:
>>
>> Hello,
>>
>> I am having an odd problem where users are unable to kill their jobs
>> with scancel. Users can submit jobs just fine and when the task
>> completes it is able to close correctly. However, if a user attempts
>> to cancel a job via scancel the SIGKILL signals are sent to the step
>> but don't complete. Slurmd then continues to send SIGKILL requests
>> until the UnkillableTimeout is hit, the slurm job is exits with an
>> error, the node enters a draining state, and the spawn processes
>> continue to run on the node.
>>
>> I'm at a loss because jobs can complete without issue which seems to
>> suggest it's not a networking or permissions issue for the slurm to
>> do job accounting tasks. A user can ssh to the node once a job is
>> submitted and kill the subprocesses manually at which point slurm
>> completes the epilog and the node returns to idle.
>>
>> Does anyone know what may be causing such behavior? Please let me
>> know any slurm.conf or cgroup.conf settings that would be helpful to
>> diagnose this issue. I'm quite stumped by this one.
>>
>> --
>>
>> Willy Markuske
>>
>> HPC Systems Engineer
>>
>>
>>
>> Research Data Services
>>
>> P: (858) 246-5593
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201118/84862e9f/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SDSClogo-plusname-red.jpg
Type: image/jpeg
Size: 9464 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201118/84862e9f/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xD42F81D406AC0BA2.asc
Type: application/pgp-keys
Size: 3228 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201118/84862e9f/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201118/84862e9f/attachment-0001.sig>
More information about the slurm-users
mailing list