[slurm-users] Users can't scancel

Wed Nov 18 19:18:49 UTC 2020

These log lines about the prolog script looks very suspicious to me:

[2020-11-18T10:19:35.388] debug:  [job 110] attempting to run prolog 
[/cm/local/apps/cmd/scripts/prolog]

then

[2020-11-18T10:21:10.121] debug:  Waiting for job 110's prolog to complete
[2020-11-18T10:21:10.121] debug:  Finished wait for job 110's prolog to 
complete

After Two minutes past, logs still complaining about waiting prolog 
script. Would you check prolog.

Ahmet M.

18.11.2020 21:33 tarihinde William Markuske yazdı:
>
> The epilog script does have exit 0 set at the end. Epilogs exit 
> cleanly when run.
>
> With log set to debug5 I get the following results for any scancel call.
>
> Submit host slurmctld.log
>
>> [2020-11-18T10:19:34.944] _slurm_rpc_submit_batch_job: JobId=110 
>> InitPrio=110503 usec=191
>> [2020-11-18T10:19:35.231] debug:  sched: Running job scheduler
>> [2020-11-18T10:19:35.231] sched: Allocate JobId=110 
>> NodeList=martin-cn-01 #CPUs=1 Partition=martin-compute
>> [2020-11-18T10:19:35.387] prolog_running_decr: Configuration for 
>> JobId=110 is complete
>> [2020-11-18T10:19:39.769] _slurm_rpc_kill_job: REQUEST_KILL_JOB 
>> JobId=110 uid 0
>> [2020-11-18T10:19:39.770] job_signal: 9 of running JobId=110 
>> successful 0x8004
>> [2020-11-18T10:20:59.282] Resending TERMINATE_JOB request JobId=110 
>> Nodelist=martin-cn-01
>> [2020-11-18T10:21:10.000] update_node: node martin-cn-01 reason set 
>> to: Kill task failed
>> [2020-11-18T10:21:10.000] update_node: node martin-cn-01 state set to 
>> DRAINING
>> [2020-11-18T10:21:10.001] error: slurmd error running JobId=110 on 
>> node(s)=martin-cn-01: Kill task failed
>> [2020-11-18T10:21:10.124] cleanup_completing: JobId=110 completion 
>> process took 91 seconds
> Compute node slurmd.log
>
>> [2020-11-18T10:19:35.388] task_p_slurmd_batch_request: 110
>> [2020-11-18T10:19:35.388] task/affinity: job 110 CPU input mask for 
>> node: 0x00000000000000000000000000000003
>> [2020-11-18T10:19:35.388] task/affinity: job 110 CPU final HW mask 
>> for node: 0x00000000000000010000000000000001
>> [2020-11-18T10:19:35.388] debug:  [job 110] attempting to run prolog 
>> [/cm/local/apps/cmd/scripts/prolog]
>> [2020-11-18T10:19:35.392] _run_prolog: run job script took usec=4085
>> [2020-11-18T10:19:35.392] _run_prolog: prolog with lock for job 110 
>> ran for 0 seconds
>> [2020-11-18T10:19:35.393] Launching batch job 110 for UID 0
>> [2020-11-18T10:19:35.398] debug:  AcctGatherEnergy NONE plugin loaded
>> [2020-11-18T10:19:35.398] debug:  AcctGatherProfile NONE plugin loaded
>> [2020-11-18T10:19:35.398] debug:  AcctGatherInterconnect NONE plugin 
>> loaded
>> [2020-11-18T10:19:35.398] debug:  AcctGatherFilesystem NONE plugin loaded
>> [2020-11-18T10:19:35.398] debug:  switch NONE plugin loaded
>> [2020-11-18T10:19:35.398] debug:  init: Gres GPU plugin loaded
>> [2020-11-18T10:19:35.400] [110.batch] Considering each NUMA node as a 
>> socket
>> [2020-11-18T10:19:35.400] [110.batch] debug:  CPUs:128 Boards:1 
>> Sockets:8 CoresPerSocket:8 ThreadsPerCore:2
>> [2020-11-18T10:19:35.400] [110.batch] debug:  Job accounting gather 
>> cgroup plugin loaded
>> [2020-11-18T10:19:35.401] [110.batch] debug:  cont_id hasn't been set 
>> yet not running poll
>> [2020-11-18T10:19:35.401] [110.batch] debug:  laying out the 1 tasks 
>> on 1 hosts martin-cn-01 dist 2
>> [2020-11-18T10:19:35.401] [110.batch] debug:  Message thread started 
>> pid = 267029
>> [2020-11-18T10:19:35.401] [110.batch] task affinity plugin loaded 
>> with CPU mask 
>> 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffffffffffffffffffff
>> [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: now 
>> constraining jobs allocated cores
>> [2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup/memory: 
>> total:515440M allowed:100%(enforced), swap:0%(enforced), 
>> max:100%(515440M) max+swap:100%(1030880M) min:30M kmem:100%(515440M 
>> enforced) min:30M swappiness:100(set)
>> [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: now 
>> constraining jobs allocated memory
>> [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: unable to 
>> open 
>> /cm/shared/apps/slurm/var/etc/flamingo/cgroup_allowed_devices_file.conf: 
>> No such file or directory
>> [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: now 
>> constraining jobs allocated devices
>> [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: loaded
>> [2020-11-18T10:19:35.402] [110.batch] debug:  Checkpoint plugin 
>> loaded: checkpoint/none
>> [2020-11-18T10:19:35.402] [110.batch] Munge credential signature 
>> plugin loaded
>> [2020-11-18T10:19:35.402] [110.batch] debug:  job_container none 
>> plugin loaded
>> [2020-11-18T10:19:35.402] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/freezer/slurm' already exists
>> [2020-11-18T10:19:35.402] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/freezer/slurm/uid_0' already exists
>> [2020-11-18T10:19:35.403] [110.batch] debug:  spank: opening plugin 
>> stack /etc/slurm/plugstack.conf
>> [2020-11-18T10:19:35.403] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/cpuset/slurm' already exists
>> [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: job 
>> abstract cores are '0'
>> [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: step 
>> abstract cores are '0'
>> [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: job 
>> physical cores are '0,64'
>> [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: step 
>> physical cores are '0,64'
>> [2020-11-18T10:19:35.403] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/cpuset/slurm/uid_0' already exists
>> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/memory/slurm' already exists
>> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/memory/slurm/uid_0' already exists
>> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists
>> [2020-11-18T10:19:35.406] [110.batch] task/cgroup: 
>> /slurm/uid_0/job_110: alloc=4096MB mem.limit=4096MB memsw.limit=4096MB
>> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' already 
>> exists
>> [2020-11-18T10:19:35.406] [110.batch] task/cgroup: 
>> /slurm/uid_0/job_110/step_batch: alloc=4096MB mem.limit=4096MB 
>> memsw.limit=4096MB
>> [2020-11-18T10:19:35.406] [110.batch] debug: _oom_event_monitor: started.
>> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/devices/slurm' already exists
>> [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/devices/slurm/uid_0' already exists
>> [2020-11-18T10:19:35.438] [110.batch] debug level = 2
>> [2020-11-18T10:19:35.438] [110.batch] starting 1 tasks
>> [2020-11-18T10:19:35.438] [110.batch] task 0 (267034) started 
>> 2020-11-18T10:19:35
>> [2020-11-18T10:19:35.438] [110.batch] debug:  Setting slurmstepd 
>> oom_adj to -1000
>> [2020-11-18T10:19:35.438] [110.batch] debug: 
>> jobacct_gather_cgroup_cpuacct_attach_task: jobid 110 stepid 
>> 4294967294 taskid 0 max_task_id 0
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/cpuacct/slurm' already exists
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0' already exists
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110' already exists
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110/step_batch' 
>> already exists
>> [2020-11-18T10:19:35.439] [110.batch] debug: 
>> jobacct_gather_cgroup_memory_attach_task: jobid 110 stepid 4294967294 
>> taskid 0 max_task_id 0
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/memory/slurm' already exists
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/memory/slurm/uid_0' already exists
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' already 
>> exists
>> [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: 
>> cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch/task_0' 
>> already exists
>> [2020-11-18T10:19:35.440] [110.batch] debug:  task_p_pre_launch: 
>> affinity jobid 110.4294967294, task:0 bind:256
>> [2020-11-18T10:19:35.440] [110.batch] task_p_pre_launch: Using 
>> sched_affinity for tasks
>> [2020-11-18T10:19:39.770] debug:  _rpc_terminate_job, uid = 450
>> [2020-11-18T10:19:39.770] debug: task_p_slurmd_release_resources: 
>> affinity jobid 110
>> [2020-11-18T10:19:39.770] debug:  credential for job 110 revoked
>> [2020-11-18T10:19:39.771] [110.batch] debug:  Handling 
>> REQUEST_SIGNAL_CONTAINER
>> [2020-11-18T10:19:39.771] [110.batch] debug: _handle_signal_container 
>> for step=110.4294967294 uid=450 signal=18
>> [2020-11-18T10:19:39.771] [110.batch] Sent signal 18 to 110.4294967294
>> [2020-11-18T10:19:39.771] [110.batch] debug:  Handling 
>> REQUEST_SIGNAL_CONTAINER
>> [2020-11-18T10:19:39.771] [110.batch] debug: _handle_signal_container 
>> for step=110.4294967294 uid=450 signal=15
>> [2020-11-18T10:19:39.771] [110.batch] error: *** JOB 110 ON 
>> martin-cn-01 CANCELLED AT 2020-11-18T10:19:39 ***
>> [2020-11-18T10:19:39.771] [110.batch] Sent signal 15 to 110.4294967294
>> [2020-11-18T10:19:39.771] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:19:39.792] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:19:39.842] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:19:39.942] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:19:40.442] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:19:41.442] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:19:42.443] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:19:43.443] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:19:44.443] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:19:45.443] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:19:46.443] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:19:47.443] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:19:48.444] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:19:49.444] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:19:50.444] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:20:00.444] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:20:09.445] [110.batch] debug:  Handling 
>> REQUEST_STEP_TERMINATE
>> [2020-11-18T10:20:09.445] [110.batch] debug:  _handle_terminate for 
>> step=110.4294967294 uid=0
>> [2020-11-18T10:20:09.445] [110.batch] Sent SIGKILL signal to 
>> 110.4294967294
>> [2020-11-18T10:20:09.445] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:20:09.465] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:20:09.515] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:20:09.616] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:20:10.116] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:20:11.116] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:20:11.116] [110.batch] debug:  Handling 
>> REQUEST_STEP_TERMINATE
>> [2020-11-18T10:20:11.116] [110.batch] debug:  _handle_terminate for 
>> step=110.4294967294 uid=0
>> [2020-11-18T10:20:11.116] [110.batch] Sent SIGKILL signal to 
>> 110.4294967294
>> [2020-11-18T10:20:12.116] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:20:12.116] [110.batch] debug:  Handling 
>> REQUEST_STEP_TERMINATE
>> [2020-11-18T10:20:12.116] [110.batch] debug:  _handle_terminate for 
>> step=110.4294967294 uid=0
> SIGKILL signals continue until the unkillable timelimit is met and 
> then slurm closes the job leaving the processes running with output:
>
>> [2020-11-18T10:20:59.282] debug: _rpc_terminate_job, uid = 450
>> [2020-11-18T10:20:59.282] debug: task_p_slurmd_release_resources: 
>> affinity jobid 110
>> [2020-11-18T10:21:00.121] [110.batch] debug:  Handling REQUEST_STATE
>> [2020-11-18T10:21:00.121] [110.batch] debug:  Handling 
>> REQUEST_STEP_TERMINATE
>> [2020-11-18T10:21:00.121] [110.batch] debug:  _handle_terminate for 
>> step=110.4294967294 uid=0
>> [2020-11-18T10:21:00.121] [110.batch] Sent SIGKILL signal to 
>> 110.4294967294
>> [2020-11-18T10:21:10.000] [110.batch] error: *** JOB 110 STEPD 
>> TERMINATED ON martin-cn-01 AT 2020-11-18T10:21:09 DUE TO JOB NOT 
>> ENDING WITH SIGNALS ***
>> [2020-11-18T10:21:10.000] [110.batch] debug:  get_exit_code task 0 
>> killed by cmd
>> [2020-11-18T10:21:10.000] [110.batch] job 110 completed with slurm_rc 
>> = 4001, job_rc = -1
>> [2020-11-18T10:21:10.000] [110.batch] sending 
>> REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status:-1
>> [2020-11-18T10:21:10.001] [110.batch] debug:  Message thread exited
>> [2020-11-18T10:21:10.001] [110.batch] done with job
>> [2020-11-18T10:21:10.121] debug:  Waiting for job 110's prolog to 
>> complete
>> [2020-11-18T10:21:10.121] debug:  Finished wait for job 110's prolog 
>> to complete
>> [2020-11-18T10:21:10.121] debug:  [job 110] attempting to run epilog 
>> [/cm/local/apps/cmd/scripts/epilog]
>> [2020-11-18T10:21:10.124] debug:  completed epilog for jobid 110
>> [2020-11-18T10:21:10.124] debug:  Job 110: sent epilog complete msg: 
>> rc = 0
>
> Willy Markuske
>
> HPC Systems Engineer
>
> 	
>
> Research Data Services
>
> P: (858) 246-5593
>
>
> On 11/18/20 9:41 AM, mercan wrote:
>> Hi;
>>
>> Check epilog return value which comes from the return value of the 
>> last line of epilog script. Also, you can add a "exit 0" line at the 
>> last line of the epilog script to ensure to get a zero return value 
>> for testing purpose.
>>
>> Ahmet M.
>>
>>
>> 18.11.2020 20:00 tarihinde William Markuske yazdı:
>>>
>>> Hello,
>>>
>>> I am having an odd problem where users are unable to kill their jobs 
>>> with scancel. Users can submit jobs just fine and when the task 
>>> completes it is able to close correctly. However, if a user attempts 
>>> to cancel a job via scancel the SIGKILL signals are sent to the step 
>>> but don't complete. Slurmd then continues to send SIGKILL requests 
>>> until the UnkillableTimeout is hit, the slurm job is exits with an 
>>> error, the node enters a draining state, and the spawn processes 
>>> continue to run on the node.
>>>
>>> I'm at a loss because jobs can complete without issue which seems to 
>>> suggest it's not a networking or permissions issue for the slurm to 
>>> do job accounting tasks. A user can ssh to the node once a job is 
>>> submitted and kill the subprocesses manually at which point slurm 
>>> completes the epilog and the node returns to idle.
>>>
>>> Does anyone know what may be causing such behavior? Please let me 
>>> know any slurm.conf or cgroup.conf settings that would be helpful to 
>>> diagnose this issue. I'm quite stumped by this one.
>>>
>>> -- 
>>>
>>> Willy Markuske
>>>
>>> HPC Systems Engineer
>>>
>>>
>>>
>>> Research Data Services
>>>
>>> P: (858) 246-5593
>>>