<div dir="ltr"><div>Hello, <br></div><div>I've set up a few nodes on slurm to test with and am having trouble. It
seems that once a job has met it's wall time, the node that it ran on enters the comp state then remains in the drain state until I manually set the state to resume.</div><div><br></div><div>Looking at the slurm log on the head node, I see the the following relevant entries:</div><div>
[2019-03-15T09:45:12.739] Time limit exhausted for JobId=1325
<br>[2019-03-15T09:45:44.001] _slurm_rpc_complete_job_allocation: JobID=1325 State=0x8006 NodeCnt=1 error Job/step already completing or completed
<br>[2019-03-15T09:46:12.805] Resending TERMINATE_JOB request JobId=1325 Nodelist=rn003
<br>[2019-03-15T09:48:43.000] update_node: node rn003 reason set to: Kill task failed
<br>[2019-03-15T09:48:43.000] update_node: node rn003 state set to DRAINING
<br>[2019-03-15T09:48:43.000] got (ni
l)
<br>[2019-03-15T09:48:43.816] cleanup_completing: job 1325 completion process took
211 seconds
</div><div><br></div><div>It may be worth mentioning that if I run a job as root and the job hits walltime, the job is killed and the node
returns to idle, but if you submit as a non root user, which would be the case in our normal workflow, the node becomes drained once walltime is met. <br></div><div><br></div><div>Thank you, <br></div><div><br></div><div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><span><font color="#888888">Eric Rosenberg</font></span></div><span></span><span style="font-size:12.8000001907349px"></span></div></div></div></div></div>