[slurm-users] help with canceling or deleteing a job
wagner at itc.rwth-aachen.de
Wed Sep 20 11:26:49 UTC 2023
Even after rebooting, sometimes nodes are stuck because of "completing
What helps then is to set the node down and resume it afterwards:
scontrol update nodename=<nodename> state=drain reason=stuck; scontrol
update nodename=<nodename> state=resume
Am 20.09.2023 um 09:11 schrieb Ole Holm Nielsen:
> On 9/20/23 01:39, Feng Zhang wrote:
>> Restarting the slurmd dameon of the compute node should work, if the
>> node is still online and normal.
> Probably not. If the filesystem used by the job is hung, the node
> must probably be rebooted, and the filesystem must be checked.
>> On Tue, Sep 19, 2023 at 8:03 AM Felix <felix at itim-cj.ro> wrote:
>>> I have a job on my system which is running more than its time, more
>>> 4 days.
>>> 1808851 debug gridjob atlas01 CG 4-00:00:19 1 awn-047
>>> I'm trying to cancel it
>>> [@arc7-node ~]# scancel 1808851
>>> I get no message as if the job was canceled but when getting
>>> about the job, the job is still there
>>> [@arc7-node ~]# squeue | grep awn-047
>>> 1808851 debug gridjob atlas01 CG 4-00:00:19 1
>>> Can I do any other thinks to kill end the job?
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 5904 bytes
Desc: Kryptografische S/MIME-Signatur
More information about the slurm-users