[slurm-users] Re: jobs getting stuck in CG

10 Feb 2025


      I have had something similar.
The fix was to run a
scontrol reconfig
Which causes a reread of the Slurmd config
Give that a try
It might be scontrol reread. Use the manual
On Mon, Feb 10, 2025, 8:32 AM Ricardo Román-Brenes via slurm-users <
slurm-users@lists.schedmd.com> wrote:
...
Hello everyone.
I have a cluster composed of 16 nodes, with 4 of them having GPUs with no
particular configuration to manage them.
The filesystem is gluster, authentication via slapd/munge.
My problem is that very frequently, let's say at least a job daily, gets
stuck in CG. I have no idea why this happens. Manually killing the
slurmstep process releases the node but this is in no way a manageable
solution. Has anyone experienced this (and fixed it?)
Thank you.
-Ricardo
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

2026

2025

2024

[slurm-users] Re: jobs getting stuck in CG