[slurm-users] Slurm cannot kill a job which time limit exhausted

Taras Shapovalov taras.shapovalov at brightcomputing.com
Tue Mar 19 11:21:13 UTC 2019


Hey guys,

When a job max time is exceeded, then Slurm tries to kill the job and fails:

[2019-03-15T09:44:03.589] sched: _slurm_rpc_allocate_resources JobId=1325
NodeList=rn003 usec=355
[2019-03-15T09:44:03.928] prolog_running_decr: Configuration for JobID=1325
is complete
[2019-03-15T09:45:12.739] Time limit exhausted for JobId=1325
[2019-03-15T09:45:44.001] _slurm_rpc_complete_job_allocation: JobID=1325
State=0x8006 NodeCnt=1 error Job/step already completing or completed
[2019-03-15T09:46:12.805] Resending TERMINATE_JOB request JobId=1325
Nodelist=rn003
[2019-03-15T09:48:43.000] update_node: node rn003 reason set to: Kill task
failed
[2019-03-15T09:48:43.000] update_node: node rn003 state set to DRAINING
[2019-03-15T09:48:43.000] got (nil)
[2019-03-15T09:48:43.816] cleanup_completing: job 1325 completion process
took 211 seconds

This happens even on very simple "srun bash" jobs that exceed their time
limits. Have you idea what does it mean? Upgrade to the latest did not help.


Best regards,

Taras
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190319/e13638fa/attachment.html>


More information about the slurm-users mailing list