[slurm-users] Nodes remaining in drain state once job completes

Eric Rosenberg Eric.Rosenberg at stonybrook.edu
Mon Mar 18 19:40:26 UTC 2019


Hello,
I've set up a few nodes on slurm to test with and am having trouble. It
seems that once a job has met it's wall time, the node that it ran on
enters the comp state then remains in the drain state until I manually set
the state to resume.

Looking at the slurm log on the head node, I see the the following relevant
entries:
[2019-03-15T09:45:12.739] Time limit exhausted for JobId=1325
[2019-03-15T09:45:44.001] _slurm_rpc_complete_job_allocation: JobID=1325
State=0x8006 NodeCnt=1 error Job/step already completing or completed
[2019-03-15T09:46:12.805] Resending TERMINATE_JOB request JobId=1325
Nodelist=rn003
[2019-03-15T09:48:43.000] update_node: node rn003 reason set to: Kill task
failed
[2019-03-15T09:48:43.000] update_node: node rn003 state set to DRAINING
[2019-03-15T09:48:43.000] got (ni l)
[2019-03-15T09:48:43.816] cleanup_completing: job 1325 completion process
took 211 seconds

It may be worth mentioning that if I run a job as root and the job hits
walltime, the job is killed and the node returns to idle, but if you submit
as a non root user, which would be the case in our normal workflow, the
node becomes drained once walltime is met.

Thank you,

-- 
Eric Rosenberg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190318/18684383/attachment.html>


More information about the slurm-users mailing list