[slurm-users] Users can't scancel
mercan
ahmet.mercan at uhem.itu.edu.tr
Wed Nov 18 19:18:49 UTC 2020
These log lines about the prolog script looks very suspicious to me:
[2020-11-18T10:19:35.388] debug: [job 110] attempting to run prolog
[/cm/local/apps/cmd/scripts/prolog]
then
[2020-11-18T10:21:10.121] debug: Waiting for job 110's prolog to complete
[2020-11-18T10:21:10.121] debug: Finished wait for job 110's prolog to
complete
After Two minutes past, logs still complaining about waiting prolog
script. Would you check prolog.
Ahmet M.
18.11.2020 21:33 tarihinde William Markuske yazdı:
>
> The epilog script does have exit 0 set at the end. Epilogs exit
> cleanly when run.
>
> With log set to debug5 I get the following results for any scancel call.
>
> Submit host slurmctld.log
>
>> [2020-11-18T10:19:34.944] _slurm_rpc_submit_batch_job: JobId=110
>> InitPrio=110503 usec=191
>> [2020-11-18T10:19:35.231] debug: sched: Running job scheduler
>> [2020-11-18T10:19:35.231] sched: Allocate JobId=110
>> NodeList=martin-cn-01 #CPUs=1 Partition=martin-compute
>> [2020-11-18T10:19:35.387] prolog_running_decr: Configuration for
>> JobId=110 is complete
>> [2020-11-18T10:19:39.769] _slurm_rpc_kill_job: REQUEST_KILL_JOB
>> JobId=110 uid 0
>> [2020-11-18T10:19:39.770] job_signal: 9 of running JobId=110
>> successful 0x8004
>> [2020-11-18T10:20:59.282] Resending TERMINATE_JOB request JobId=110
>> Nodelist=martin-cn-01
>> [2020-11-18T10:21:10.000] update_node: node martin-cn-01 reason set
>> to: Kill task failed
>> [2020-11-18T10:21:10.000] update_node: node martin-cn-01 state set to
>> DRAINING
>> [2020-11-18T10:21:10.001] error: slurmd error running JobId=110 on
>> node(s)=martin-cn-01: Kill task failed
>> [2020-11-18T10:21:10.124] cleanup_completing: JobId=110 completion
>> process took 91 seconds
> Compute node slurmd.log
>
>> [2020-11-18T10:19:35.388] task_p_slurmd_batch_request: 110
>> [2020-11-18T10:19:35.388] task/affinity: job 110 CPU input mask for
>> node: 0x00000000000000000000000000000003
>> [2020-11-18T10:19:35.388] task/affinity: job 110 CPU final HW mask
>> for node: 0x00000000000000010000000000000001
>> [2020-11-18T10:19:35.388] debug: [job 110] attempting to run prolog
>> [/cm/local/apps/cmd/scripts/prolog]
>> [2020-11-18T10:19:35.392] _run_prolog: run job script took usec=4085
>> [2020-11-18T10:19:35.392] _run_prolog: prolog with lock for job 110
>> ran for 0 seconds
>> [2020-11-18T10:19:35.393] Launching batch job 110 for UID 0
>> [2020-11-18T10:19:35.398] debug: AcctGatherEnergy NONE plugin loaded
>> [2020-11-18T10:19:35.398] debug: AcctGatherProfile NONE plugin loaded
>> [2020-11-18T10:19:35.398] debug: AcctGatherInterconnect NONE plugin
>> loaded
>> [2020-11-18T10:19:35.398] debug: AcctGatherFilesystem NONE plugin loaded
>> [2020-11-18T10:19:35.398] debug: switch NONE plugin loaded
>> [2020-11-18T10:19:35.398] debug: init: Gres GPU plugin loaded
>> [2020-11-18T10:19:35.400] [110.batch] Considering each NUMA node as a
>> socket
>> [2020-11-18T10:19:35.400] [110.batch] debug: CPUs:128 Boards:1
>> Sockets:8 CoresPerSocket:8 ThreadsPerCore:2
>> [2020-11-18T10:19:35.400] [110.batch] debug: Job accounting gather
>> cgroup plugin loaded
>> [2020-11-18T10:19:35.401] [110.batch] debug: cont_id hasn't been set
>> yet not running poll
>> [2020-11-18T10:19:35.401] [110.batch] debug: laying out the 1 tasks
>> on 1 hosts martin-cn-01 dist 2
>> [2020-11-18T10:19:35.401] [110.batch] debug: Message thread started
>> pid = 267029
>> [2020-11-18T10:19:35.401] [110.batch] task affinity plugin loaded
>> with CPU mask
>> 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffffffffffffffffffff
>> [2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: now
>> constraining jobs allocated cores
>> [2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup/memory:
>> total:515440M allowed:100%(enforced), swap:0%(enforced),
>> max:100%(515440M) max+swap:100%(1030880M) min:30M kmem:100%(515440M
>> enforced) min:30M swappiness:100(set)
>> [2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: now
>> constraining jobs allocated memory
>> [2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: unable to
>> open
>> /cm/shared/apps/slurm/var/etc/flamingo/cgroup_allowed_devices_file.conf:
>> No such file or directory
>> [2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: now
>> constraining jobs allocated devices
>> [2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: loaded
>> [2020-11-18T10:19:35.402] [110.batch] debug: Checkpoint plugin
>> loaded: checkpoint/none
>> [2020-11-18T10:19:35.402] [110.batch] Munge credential signature
>> plugin loaded
>> [2020-11-18T10:19:35.402] [110.batch] debug: job_container none
>> plugin loaded
>> [2020-11-18T10:19:35.402] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/freezer/slurm' already exists
>> [2020-11-18T10:19:35.402] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/freezer/slurm/uid_0' already exists
>> [2020-11-18T10:19:35.403] [110.batch] debug: spank: opening plugin
>> stack /etc/slurm/plugstack.conf
>> [2020-11-18T10:19:35.403] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/cpuset/slurm' already exists
>> [2020-11-18T10:19:35.403] [110.batch] debug: task/cgroup: job
>> abstract cores are '0'
>> [2020-11-18T10:19:35.403] [110.batch] debug: task/cgroup: step
>> abstract cores are '0'
>> [2020-11-18T10:19:35.403] [110.batch] debug: task/cgroup: job
>> physical cores are '0,64'
>> [2020-11-18T10:19:35.403] [110.batch] debug: task/cgroup: step
>> physical cores are '0,64'
>> [2020-11-18T10:19:35.403] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/cpuset/slurm/uid_0' already exists
>> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/memory/slurm' already exists
>> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/memory/slurm/uid_0' already exists
>> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists
>> [2020-11-18T10:19:35.406] [110.batch] task/cgroup:
>> /slurm/uid_0/job_110: alloc=4096MB mem.limit=4096MB memsw.limit=4096MB
>> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' already
>> exists
>> [2020-11-18T10:19:35.406] [110.batch] task/cgroup:
>> /slurm/uid_0/job_110/step_batch: alloc=4096MB mem.limit=4096MB
>> memsw.limit=4096MB
>> [2020-11-18T10:19:35.406] [110.batch] debug: _oom_event_monitor: started.
>> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/devices/slurm' already exists
>> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/devices/slurm/uid_0' already exists
>> [2020-11-18T10:19:35.438] [110.batch] debug level = 2
>> [2020-11-18T10:19:35.438] [110.batch] starting 1 tasks
>> [2020-11-18T10:19:35.438] [110.batch] task 0 (267034) started
>> 2020-11-18T10:19:35
>> [2020-11-18T10:19:35.438] [110.batch] debug: Setting slurmstepd
>> oom_adj to -1000
>> [2020-11-18T10:19:35.438] [110.batch] debug:
>> jobacct_gather_cgroup_cpuacct_attach_task: jobid 110 stepid
>> 4294967294 taskid 0 max_task_id 0
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/cpuacct/slurm' already exists
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0' already exists
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110' already exists
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110/step_batch'
>> already exists
>> [2020-11-18T10:19:35.439] [110.batch] debug:
>> jobacct_gather_cgroup_memory_attach_task: jobid 110 stepid 4294967294
>> taskid 0 max_task_id 0
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/memory/slurm' already exists
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/memory/slurm/uid_0' already exists
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' already
>> exists
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch/task_0'
>> already exists
>> [2020-11-18T10:19:35.440] [110.batch] debug: task_p_pre_launch:
>> affinity jobid 110.4294967294, task:0 bind:256
>> [2020-11-18T10:19:35.440] [110.batch] task_p_pre_launch: Using
>> sched_affinity for tasks
>> [2020-11-18T10:19:39.770] debug: _rpc_terminate_job, uid = 450
>> [2020-11-18T10:19:39.770] debug: task_p_slurmd_release_resources:
>> affinity jobid 110
>> [2020-11-18T10:19:39.770] debug: credential for job 110 revoked
>> [2020-11-18T10:19:39.771] [110.batch] debug: Handling
>> REQUEST_SIGNAL_CONTAINER
>> [2020-11-18T10:19:39.771] [110.batch] debug: _handle_signal_container
>> for step=110.4294967294 uid=450 signal=18
>> [2020-11-18T10:19:39.771] [110.batch] Sent signal 18 to 110.4294967294
>> [2020-11-18T10:19:39.771] [110.batch] debug: Handling
>> REQUEST_SIGNAL_CONTAINER
>> [2020-11-18T10:19:39.771] [110.batch] debug: _handle_signal_container
>> for step=110.4294967294 uid=450 signal=15
>> [2020-11-18T10:19:39.771] [110.batch] error: *** JOB 110 ON
>> martin-cn-01 CANCELLED AT 2020-11-18T10:19:39 ***
>> [2020-11-18T10:19:39.771] [110.batch] Sent signal 15 to 110.4294967294
>> [2020-11-18T10:19:39.771] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:19:39.792] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:19:39.842] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:19:39.942] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:19:40.442] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:19:41.442] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:19:42.443] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:19:43.443] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:19:44.443] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:19:45.443] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:19:46.443] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:19:47.443] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:19:48.444] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:19:49.444] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:19:50.444] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:20:00.444] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:20:09.445] [110.batch] debug: Handling
>> REQUEST_STEP_TERMINATE
>> [2020-11-18T10:20:09.445] [110.batch] debug: _handle_terminate for
>> step=110.4294967294 uid=0
>> [2020-11-18T10:20:09.445] [110.batch] Sent SIGKILL signal to
>> 110.4294967294
>> [2020-11-18T10:20:09.445] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:20:09.465] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:20:09.515] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:20:09.616] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:20:10.116] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:20:11.116] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:20:11.116] [110.batch] debug: Handling
>> REQUEST_STEP_TERMINATE
>> [2020-11-18T10:20:11.116] [110.batch] debug: _handle_terminate for
>> step=110.4294967294 uid=0
>> [2020-11-18T10:20:11.116] [110.batch] Sent SIGKILL signal to
>> 110.4294967294
>> [2020-11-18T10:20:12.116] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:20:12.116] [110.batch] debug: Handling
>> REQUEST_STEP_TERMINATE
>> [2020-11-18T10:20:12.116] [110.batch] debug: _handle_terminate for
>> step=110.4294967294 uid=0
> SIGKILL signals continue until the unkillable timelimit is met and
> then slurm closes the job leaving the processes running with output:
>
>> [2020-11-18T10:20:59.282] debug: _rpc_terminate_job, uid = 450
>> [2020-11-18T10:20:59.282] debug: task_p_slurmd_release_resources:
>> affinity jobid 110
>> [2020-11-18T10:21:00.121] [110.batch] debug: Handling REQUEST_STATE
>> [2020-11-18T10:21:00.121] [110.batch] debug: Handling
>> REQUEST_STEP_TERMINATE
>> [2020-11-18T10:21:00.121] [110.batch] debug: _handle_terminate for
>> step=110.4294967294 uid=0
>> [2020-11-18T10:21:00.121] [110.batch] Sent SIGKILL signal to
>> 110.4294967294
>> [2020-11-18T10:21:10.000] [110.batch] error: *** JOB 110 STEPD
>> TERMINATED ON martin-cn-01 AT 2020-11-18T10:21:09 DUE TO JOB NOT
>> ENDING WITH SIGNALS ***
>> [2020-11-18T10:21:10.000] [110.batch] debug: get_exit_code task 0
>> killed by cmd
>> [2020-11-18T10:21:10.000] [110.batch] job 110 completed with slurm_rc
>> = 4001, job_rc = -1
>> [2020-11-18T10:21:10.000] [110.batch] sending
>> REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status:-1
>> [2020-11-18T10:21:10.001] [110.batch] debug: Message thread exited
>> [2020-11-18T10:21:10.001] [110.batch] done with job
>> [2020-11-18T10:21:10.121] debug: Waiting for job 110's prolog to
>> complete
>> [2020-11-18T10:21:10.121] debug: Finished wait for job 110's prolog
>> to complete
>> [2020-11-18T10:21:10.121] debug: [job 110] attempting to run epilog
>> [/cm/local/apps/cmd/scripts/epilog]
>> [2020-11-18T10:21:10.124] debug: completed epilog for jobid 110
>> [2020-11-18T10:21:10.124] debug: Job 110: sent epilog complete msg:
>> rc = 0
>
> Willy Markuske
>
> HPC Systems Engineer
>
>
>
> Research Data Services
>
> P: (858) 246-5593
>
>
> On 11/18/20 9:41 AM, mercan wrote:
>> Hi;
>>
>> Check epilog return value which comes from the return value of the
>> last line of epilog script. Also, you can add a "exit 0" line at the
>> last line of the epilog script to ensure to get a zero return value
>> for testing purpose.
>>
>> Ahmet M.
>>
>>
>> 18.11.2020 20:00 tarihinde William Markuske yazdı:
>>>
>>> Hello,
>>>
>>> I am having an odd problem where users are unable to kill their jobs
>>> with scancel. Users can submit jobs just fine and when the task
>>> completes it is able to close correctly. However, if a user attempts
>>> to cancel a job via scancel the SIGKILL signals are sent to the step
>>> but don't complete. Slurmd then continues to send SIGKILL requests
>>> until the UnkillableTimeout is hit, the slurm job is exits with an
>>> error, the node enters a draining state, and the spawn processes
>>> continue to run on the node.
>>>
>>> I'm at a loss because jobs can complete without issue which seems to
>>> suggest it's not a networking or permissions issue for the slurm to
>>> do job accounting tasks. A user can ssh to the node once a job is
>>> submitted and kill the subprocesses manually at which point slurm
>>> completes the epilog and the node returns to idle.
>>>
>>> Does anyone know what may be causing such behavior? Please let me
>>> know any slurm.conf or cgroup.conf settings that would be helpful to
>>> diagnose this issue. I'm quite stumped by this one.
>>>
>>> --
>>>
>>> Willy Markuske
>>>
>>> HPC Systems Engineer
>>>
>>>
>>>
>>> Research Data Services
>>>
>>> P: (858) 246-5593
>>>
More information about the slurm-users
mailing list