[slurm-users] help with canceling or deleteing a job
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Tue Sep 19 12:27:44 UTC 2023
On 9/19/23 13:59, Felix wrote:
> Hello
>
> I have a job on my system which is running more than its time, more than 4
> days.
>
> 1808851 debug gridjob atlas01 CG 4-00:00:19 1 awn-047
The job has state "CG" which means "Completing". The Completing status is
explained in "man sinfo".
This means that Slurm is trying to cancel the job, but it hangs for some
reason.
> I'm trying to cancel it
>
> [@arc7-node ~]# scancel 1808851
>
> I get no message as if the job was canceled but when getting information
> about the job, the job is still there
>
> [@arc7-node ~]# squeue | grep awn-047
> 1808851 debug gridjob atlas01 CG 4-00:00:19 1 awn-047
What is your UnkillableStepTimeout parameter? The default of 60 seconds
can be changed in slurm.conf. My cluster:
$ scontrol show config | grep UnkillableStepTimeout
UnkillableStepTimeout = 126 sec
> Can I do any other thinks to kill end the job?
It may be impossible to kill the job's processes, for example, if a
filesystem is hanging.
You may log in to the node and give the job's processes a "kill -9". Or
just reboot the node.
/Ole
More information about the slurm-users
mailing list