[slurm-users] help with canceling or deleteing a job

Wagner, Marcus wagner at itc.rwth-aachen.de
Wed Sep 20 11:26:49 UTC 2023


Even after rebooting, sometimes nodes are stuck because of "completing 
jobs".

What helps then is to set the node down and resume it afterwards:

scontrol update nodename=<nodename> state=drain reason=stuck; scontrol 
update nodename=<nodename> state=resume


Best
Marcus

Am 20.09.2023 um 09:11 schrieb Ole Holm Nielsen:
> On 9/20/23 01:39, Feng Zhang wrote:
>> Restarting the slurmd dameon of the compute node should work, if the
>> node is still online and normal.
>
> Probably not.  If the filesystem used by the job is hung, the node 
> must probably be rebooted, and the filesystem must be checked.
>
> /Ole
>
>> On Tue, Sep 19, 2023 at 8:03 AM Felix <felix at itim-cj.ro> wrote:
>>>
>>> Hello
>>>
>>> I have a job on my system which is running more than its time, more 
>>> than
>>> 4 days.
>>>
>>> 1808851     debug  gridjob  atlas01 CG 4-00:00:19      1 awn-047
>>>
>>> I'm trying to cancel it
>>>
>>> [@arc7-node ~]# scancel 1808851
>>>
>>> I get no message as if the job was canceled but when getting 
>>> information
>>> about the job, the job is still there
>>>
>>> [@arc7-node ~]# squeue | grep awn-047
>>>              1808851     debug  gridjob  atlas01 CG 4-00:00:19 1 
>>> awn-047
>>>
>>> Can I do any other thinks to kill end the job?
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5904 bytes
Desc: Kryptografische S/MIME-Signatur
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230920/6e37629c/attachment-0001.bin>


More information about the slurm-users mailing list