[slurm-users] scanceling a job puts the node in a draining state

Tue Apr 25 17:12:08 UTC 2023

Hi -

This was a known bug:  https://bugs.schedmd.com/show_bug.cgi?id=3941

However, the bug report says this was fixed in version 17.02.7.

The problem is we're running version 17.11.2, but appear to still have 
this bug going on:

[2023-04-18T17:09:42.482] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 
163837 uid 38879
[2023-04-18T17:09:42.482] email msg to siming at gmail.com: SLURM 
Job_id=163837 Name=clip_v3_1view_s3dis_mink_crop_075 Ended, Run time 
00:37:37, CANCELLED, ExitCode 0
[2023-04-18T17:09:45.104] _slurm_rpc_submit_batch_job: JobId=163843 
InitPrio=43243 usec=267
[2023-04-18T17:10:33.057] Resending TERMINATE_JOB request JobId=163837 
Nodelist=dgx-4
[2023-04-18T17:10:48.244] error: slurmd error running JobId=163837 on 
node(s)=dgx-4: Kill task failed
[2023-04-18T17:10:48.244] drain_nodes: node dgx-4 state set to DRAIN
[2023-04-18T17:10:53.524] cleanup_completing: job 163837 completion 
process took 71 seconds

That particular node is still in a draining state a week later. Just 
wondering if I'm missing something.