[slurm-users] Re: jobs getting stuck in CG

10 Feb 2025


      I observed similar symptoms when we had issues with the shared Lustre 
file system. When the file system couldn't complete an I/O operation, 
the process in Slurm remained in the CG state until the file system 
became responsive again. An additional symptom was that the blocking 
process was stuck in the D state.
On 10/02/2025 09:28, Ricardo Román-Brenes via slurm-users wrote:
...
Hello everyone.
I have a cluster composed of 16 nodes, with 4 of them having GPUs with 
no particular configuration to manage them.
The filesystem is gluster, authentication via slapd/munge.
My problem is that very frequently, let's say at least a job daily, 
gets stuck in CG. I have no idea why this happens. Manually killing 
the slurmstep process releases the node but this is in no way a 
manageable solution. Has anyone experienced this (and fixed it?)
Thank you.
-Ricardo
-- 
best regards | pozdrawiam serdecznie
*Michał Kadlof*
Head of the high performance computing center 	Kierownik ośrodka 
obliczeniowego HPC
Eden^N cluster administrator 	Administrator klastra obliczeniowego Eden^N
Structural and Functional Genomics Laboratory 	Laboratorium Genomiki 
Strukturalnej i Funkcjonalnej
Faculty of Mathematics and Computer Science 	Wydział Matematyki i Nauk 
Informacyjnych
Warsaw University of Technology 	Politechnika Warszawska

2025

2024

[slurm-users] Re: jobs getting stuck in CG