[slurm-users] [External] jobs stuck in "CG" state

Florian Zillner fzillner at lenovo.com
Fri Aug 20 14:55:26 UTC 2021


scancel the job, then set the nodes to a "down" state like so "scontrol update nodename=<nodename> state=down reason=cg" and resume them afterwards.
However, if there are tasks stuck, then in most cases a reboot is needed to bring the node back with in a clean state.

From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Durai Arasan <arasan.durai at gmail.com>
Sent: Friday, 20 August 2021 10:31
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: [External][slurm-users] jobs stuck in "CG" state


We have a huge number of jobs stuck in CG state from a user who probably wrote code with bad I/O. "scancel" does not make them go away. Is there a way for admins to get rid of these jobs without draining and rebooting the nodes. I read somewhere that killing the respective slurmstepd process will do the job. Is this possible? Any other solutions? Also are there any parameters in slurm.conf one can set to manage such situations better?

MPI Tübingen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210820/b42779a7/attachment-0001.htm>

More information about the slurm-users mailing list