[slurm-users] Nodes remaining in drain state once job completes
Eric.Rosenberg at stonybrook.edu
Mon Mar 18 19:40:26 UTC 2019
I've set up a few nodes on slurm to test with and am having trouble. It
seems that once a job has met it's wall time, the node that it ran on
enters the comp state then remains in the drain state until I manually set
the state to resume.
Looking at the slurm log on the head node, I see the the following relevant
[2019-03-15T09:45:12.739] Time limit exhausted for JobId=1325
[2019-03-15T09:45:44.001] _slurm_rpc_complete_job_allocation: JobID=1325
State=0x8006 NodeCnt=1 error Job/step already completing or completed
[2019-03-15T09:46:12.805] Resending TERMINATE_JOB request JobId=1325
[2019-03-15T09:48:43.000] update_node: node rn003 reason set to: Kill task
[2019-03-15T09:48:43.000] update_node: node rn003 state set to DRAINING
[2019-03-15T09:48:43.000] got (ni l)
[2019-03-15T09:48:43.816] cleanup_completing: job 1325 completion process
took 211 seconds
It may be worth mentioning that if I run a job as root and the job hits
walltime, the job is killed and the node returns to idle, but if you submit
as a non root user, which would be the case in our normal workflow, the
node becomes drained once walltime is met.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users