[slurm-users] scanceling a job puts the node in a draining state
Patrick Goetz
pgoetz at math.utexas.edu
Tue Apr 25 17:12:08 UTC 2023
Hi -
This was a known bug: https://bugs.schedmd.com/show_bug.cgi?id=3941
However, the bug report says this was fixed in version 17.02.7.
The problem is we're running version 17.11.2, but appear to still have
this bug going on:
[2023-04-18T17:09:42.482] _slurm_rpc_kill_job: REQUEST_KILL_JOB job
163837 uid 38879
[2023-04-18T17:09:42.482] email msg to siming at gmail.com: SLURM
Job_id=163837 Name=clip_v3_1view_s3dis_mink_crop_075 Ended, Run time
00:37:37, CANCELLED, ExitCode 0
[2023-04-18T17:09:45.104] _slurm_rpc_submit_batch_job: JobId=163843
InitPrio=43243 usec=267
[2023-04-18T17:10:33.057] Resending TERMINATE_JOB request JobId=163837
Nodelist=dgx-4
[2023-04-18T17:10:48.244] error: slurmd error running JobId=163837 on
node(s)=dgx-4: Kill task failed
[2023-04-18T17:10:48.244] drain_nodes: node dgx-4 state set to DRAIN
[2023-04-18T17:10:53.524] cleanup_completing: job 163837 completion
process took 71 seconds
That particular node is still in a draining state a week later. Just
wondering if I'm missing something.
More information about the slurm-users
mailing list