Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

List overview All Threads
Download

newer

older

Re: Slurmd enabled crash with...

Upgrading nodes

archisman.pathak＠nksecurities.com

10 Apr 2024 10 Apr '24

7:56 a.m.

We are running a slurm cluster with version `slurm 22.05.8`. One of our users has reported that their jobs have been stuck at the completion stage for a long time. Referring to Slurm Workload Manager - Slurm Troubleshooting Guide we found that indeed the batchhost for the job was removed from the cluster, perhaps without draining it first.

How do we cancel/delete the jobs ?

* We tried scancel on the batch and individual job ids from both the user and from SlurmUser

Show replies by date

Paul Edmon

10 Apr 10 Apr

2:44 p.m.

Usually to clear jobs like this you have to reboot the node they are on. That will then force the scheduler to clear them.

-Paul Edmon-

On 4/10/2024 2:56 AM, archisman.pathak--- via slurm-users wrote:

...

We are running a slurm cluster with version `slurm 22.05.8`. One of our users has reported that their jobs have been stuck at the completion stage for a long time. Referring to Slurm Workload Manager - Slurm Troubleshooting Guide we found that indeed the batchhost for the job was removed from the cluster, perhaps without draining it first.

How do we cancel/delete the jobs ?

We tried scancel on the batch and individual job ids from both the user and from SlurmUser

Cutts, Tim

2:58 p.m.

We have Weka filesystems on one of our clusters and saw this; we discovered we had slightly misconfigured the weka client and the result was that Weka’s and SLURMs cgroups were fighting with each other, and this seemed to be the result. Fixing the weka cgroups config improved the problem, for us. I haven’t heard anyone complain about it since.

Tim

-- Tim Cutts Scientific Computing Platform Lead AstraZeneca

Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Cataloguehttps://azcollaboration.sharepoint.com/sites/CMU993 |

From: Paul Edmon via slurm-users slurm-users@lists.schedmd.com Date: Wednesday, 10 April 2024 at 14:46 To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them Usually to clear jobs like this you have to reboot the node they are on. That will then force the scheduler to clear them.

-Paul Edmon-

On 4/10/2024 2:56 AM, archisman.pathak--- via slurm-users wrote:

...

We are running a slurm cluster with version `slurm 22.05.8`. One of our users has reported that their jobs have been stuck at the completion stage for a long time. Referring to Slurm Workload Manager - Slurm Troubleshooting Guide we found that indeed the batchhost for the job was removed from the cluster, perhaps without draining it first.

How do we cancel/delete the jobs ?

We tried scancel on the batch and individual job ids from both the user and from SlurmUser

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com ________________________________

AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.comhttps://www.astrazeneca.com

archisman.pathak＠nksecurities.com

11 Apr 11 Apr

6:40 a.m.

Could you give more details regarding this and how you debugged the same?

archisman.pathak＠nksecurities.com

6:41 a.m.

In our case, that node has been removed from the cluster and cannot be added back right now ( is being used for some other work ). What can we do in such a case?

Christopher Samuel

7:06 a.m.

On 4/10/24 10:41 pm, archisman.pathak--- via slurm-users wrote:

...

In our case, that node has been removed from the cluster and cannot be added back right now ( is being used for some other work ). What can we do in such a case?

Mark the node as "DOWN" in Slurm, this is what we do when we get jobs caught in this state (and there's nothing else on the node for our shared nodes).

Best of luck! Chris

-- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

577

Age (days ago)

578

Last active (days ago)

slurm-users@lists.schedmd.com

5 comments

4 participants

tags (0)

participants (4)

archisman.pathak＠nksecurities.com
Christopher Samuel
Cutts, Tim
Paul Edmon