I observed similar symptoms when we had issues with the shared Lustre file system. When the file system couldn't complete an I/O operation, the process in Slurm remained in the CG state until the file system became responsive again. An additional symptom was that the blocking process was stuck in the D state.
Hello everyone.
I have a cluster composed of 16 nodes, with 4 of them having GPUs with no particular configuration to manage them.The filesystem is gluster, authentication via slapd/munge.
My problem is that very frequently, let's say at least a job daily, gets stuck in CG. I have no idea why this happens. Manually killing the slurmstep process releases the node but this is in no way a manageable solution. Has anyone experienced this (and fixed it?)
Thank you.
-Ricardo
Head of the high performance computing center | Kierownik ośrodka obliczeniowego HPC |
EdenN cluster administrator | Administrator klastra obliczeniowego EdenN |
Structural and Functional Genomics Laboratory | Laboratorium Genomiki Strukturalnej i Funkcjonalnej |
Faculty of Mathematics and Computer Science | Wydział Matematyki i Nauk Informacyjnych |
Warsaw University of Technology | Politechnika Warszawska |