10 Feb
2025
10 Feb
'25
8:28 a.m.
Hello everyone. I have a cluster composed of 16 nodes, with 4 of them having GPUs with no particular configuration to manage them. The filesystem is gluster, authentication via slapd/munge. My problem is that very frequently, let's say at least a job daily, gets stuck in CG. I have no idea why this happens. Manually killing the slurmstep process releases the node but this is in no way a manageable solution. Has anyone experienced this (and fixed it?) Thank you. -Ricardo