[slurm-users] Slurm cannot kill a job which time limit exhausted

Tue Mar 19 16:59:25 UTC 2019

Slurm is trying to kill the job that is exceeding it's time limit, but 
the job doesn't die, so Slurm marks the node down because it sees this 
as a problem with the node. Increasing the value for GraceTime or  
KillWait might help:

> *GraceTime*
>     Specifies, in units of seconds, the preemption grace time to be
>     extended to a job which has been selected for preemption. The
>     default value is zero, no preemption grace time is allowed on this
>     partition. Once a job has been selected for preemption, its end
>     time is set to the current time plus GraceTime. The job's tasks
>     are immediately sent SIGCONT and SIGTERM signals in order to
>     provide notification of its imminent termination. This is followed
>     by the SIGCONT, SIGTERM and SIGKILL signal sequence upon reaching
>     its new end time. This second set of signals is sent to both the
>     tasks *and* the containing batch script, if applicable. Meaningful
>     only for PreemptMode=CANCEL. See also the global *KillWait*
>     configuration parameter. 

> *KillWait*
>     The interval, in seconds, given to a job's processes between the
>     SIGTERM and SIGKILL signals upon reaching its time limit. If the
>     job fails to terminate gracefully in the interval specified, it
>     will be forcibly terminated. The default value is 30 seconds. The
>     value may not exceed 65533. 

--
Prentice

On 3/19/19 7:21 AM, Taras Shapovalov wrote:
> Hey guys,
>
> When a job max time is exceeded, then Slurm tries to kill the job and 
> fails:
>
> [2019-03-15T09:44:03.589] sched: _slurm_rpc_allocate_resources 
> JobId=1325 NodeList=rn003 usec=355
> [2019-03-15T09:44:03.928] prolog_running_decr: Configuration for 
> JobID=1325 is complete
> [2019-03-15T09:45:12.739] Time limit exhausted for JobId=1325
> [2019-03-15T09:45:44.001] _slurm_rpc_complete_job_allocation: 
> JobID=1325 State=0x8006 NodeCnt=1 error Job/step already completing or 
> completed
> [2019-03-15T09:46:12.805] Resending TERMINATE_JOB request JobId=1325 
> Nodelist=rn003
> [2019-03-15T09:48:43.000] update_node: node rn003 reason set to: Kill 
> task failed
> [2019-03-15T09:48:43.000] update_node: node rn003 state set to DRAINING
> [2019-03-15T09:48:43.000] got (nil)
> [2019-03-15T09:48:43.816] cleanup_completing: job 1325 completion 
> process took 211 seconds
>
> This happens even on very simple "srun bash" jobs that exceed their 
> time limits. Have you idea what does it mean? Upgrade to the latest 
> did not help.
>
>
> Best regards,
>
> Taras

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190319/dacb547c/attachment.html>