Hello everyone.
I have a cluster composed of 16 nodes, with 4 of them having GPUs with no particular configuration to manage them. The filesystem is gluster, authentication via slapd/munge.
My problem is that very frequently, let's say at least a job daily, gets stuck in CG. I have no idea why this happens. Manually killing the slurmstep process releases the node but this is in no way a manageable solution. Has anyone experienced this (and fixed it?)
Thank you.
-Ricardo
I have had something similar. The fix was to run a scontrol reconfig Which causes a reread of the Slurmd config Give that a try
It might be scontrol reread. Use the manual
On Mon, Feb 10, 2025, 8:32 AM Ricardo Román-Brenes via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hello everyone.
I have a cluster composed of 16 nodes, with 4 of them having GPUs with no particular configuration to manage them. The filesystem is gluster, authentication via slapd/munge.
My problem is that very frequently, let's say at least a job daily, gets stuck in CG. I have no idea why this happens. Manually killing the slurmstep process releases the node but this is in no way a manageable solution. Has anyone experienced this (and fixed it?)
Thank you.
-Ricardo
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Belay that reply. Different issue. In that case salloc works OK but stun says user has no job on the node
On Mon, Feb 10, 2025, 9:24 AM John Hearns hearnsj@gmail.com wrote:
I have had something similar. The fix was to run a scontrol reconfig Which causes a reread of the Slurmd config Give that a try
It might be scontrol reread. Use the manual
On Mon, Feb 10, 2025, 8:32 AM Ricardo Román-Brenes via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hello everyone.
I have a cluster composed of 16 nodes, with 4 of them having GPUs with no particular configuration to manage them. The filesystem is gluster, authentication via slapd/munge.
My problem is that very frequently, let's say at least a job daily, gets stuck in CG. I have no idea why this happens. Manually killing the slurmstep process releases the node but this is in no way a manageable solution. Has anyone experienced this (and fixed it?)
Thank you.
-Ricardo
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
I observed similar symptoms when we had issues with the shared Lustre file system. When the file system couldn't complete an I/O operation, the process in Slurm remained in the CG state until the file system became responsive again. An additional symptom was that the blocking process was stuck in the D state.
On 10/02/2025 09:28, Ricardo Román-Brenes via slurm-users wrote:
Hello everyone.
I have a cluster composed of 16 nodes, with 4 of them having GPUs with no particular configuration to manage them. The filesystem is gluster, authentication via slapd/munge.
My problem is that very frequently, let's say at least a job daily, gets stuck in CG. I have no idea why this happens. Manually killing the slurmstep process releases the node but this is in no way a manageable solution. Has anyone experienced this (and fixed it?)
Thank you.
-Ricardo
ps -eaf --forest is your friend with Slurm
On Mon, Feb 10, 2025, 12:08 PM Michał Kadlof via slurm-users < slurm-users@lists.schedmd.com> wrote:
I observed similar symptoms when we had issues with the shared Lustre file system. When the file system couldn't complete an I/O operation, the process in Slurm remained in the CG state until the file system became responsive again. An additional symptom was that the blocking process was stuck in the D state. On 10/02/2025 09:28, Ricardo Román-Brenes via slurm-users wrote:
Hello everyone.
I have a cluster composed of 16 nodes, with 4 of them having GPUs with no particular configuration to manage them. The filesystem is gluster, authentication via slapd/munge.
My problem is that very frequently, let's say at least a job daily, gets stuck in CG. I have no idea why this happens. Manually killing the slurmstep process releases the node but this is in no way a manageable solution. Has anyone experienced this (and fixed it?)
Thank you.
-Ricardo
-- best regards | pozdrawiam serdecznie *Michał Kadlof* Head of the high performance computing center Kierownik ośrodka obliczeniowego HPC EdenN cluster administrator Administrator klastra obliczeniowego EdenN Structural and Functional Genomics Laboratory Laboratorium Genomiki Strukturalnej i Funkcjonalnej Faculty of Mathematics and Computer Science Wydział Matematyki i Nauk Informacyjnych Warsaw University of Technology Politechnika Warszawska
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
On 2/10/25 7:05 am, Michał Kadlof via slurm-users wrote:
I observed similar symptoms when we had issues with the shared Lustre file system. When the file system couldn't complete an I/O operation, the process in Slurm remained in the CG state until the file system became responsive again. An additional symptom was that the blocking process was stuck in the D state.
We've seen the same behaviour, though for us we use an "UnkillableStepProgram" to deal with compute nodes where user processes (as opposed to Slurm daemons, which sounds like the issue for the original poster here) get stuck and are unkillable.
Our script does things like "echo w > /proc/sysrq-trigger" to get the kernel to dump its view of all stuck processes and then it goes through the stuck jobs cgroup to find all the processes and dump /proc/$PID/stack for each process and then thread it finds there.
In the end it either marks the node down (if it's the only job on the node which will mark the job as complete in Slurm, though will not free up those stuck processes) or drains the node if it's running multiple jobs. In both cases we'll come back and check the issue out (and our SREs will wake us up if they think there's an unusual number of these).
That final step is important because a node stuck completing can really confuse backfill scheduling for us as slurmctld assumes it will become free any second now and try and use the node for planning jobs, despite it being stuck. So marking it down/drain gets it out of slurmctld's view as a potential future node.
For nodes where a Slurm daemon on the node is stuck that script will not fire and so our SRE's have alarms that trip after a node has been completing for longer than a certain amount of time. They go and look at what's going on and get the node out of the system before utilisation collapses (and wake us up if that number seems to be increasing).
All the best, Chris