GPU fallen of the bus
Hello, We are noticing that some of the gpus on a specific node have "fallen of the bus". We would like to remove this specific gpu from the slurm scheduler. For example, let's say GPU0 has fallen off the bus, we would need the rest of the GPU1-8 to be available and make GPU0 not able to be allocated. How can we achieve that? I have read about blacklist on the slurm forum but it seems there is no satisfying solution. Best, *Fritz Ratnasamy*Data Scientist Information Technology
Hi Fritz, On 27/05/2026 06:46, Ratnasamy, Fritz via slurm-users wrote:
We are noticing that some of the gpus on a specific node have "fallen of the bus". We would like to remove this specific gpu from the slurm scheduler. For example, let's say GPU0 has fallen off the bus, we would need the rest of the GPU1-8 to be available and make GPU0 not able to be allocated. How can we achieve that? I have read about blacklist on the slurm forum but it seems there is no satisfying solution.
We asked the same question recently. See: https://support.schedmd.com/show_bug.cgi?id=25180 and https://support.schedmd.com/show_bug.cgi?id=25181. Our current cumbersome method is to drop the 'broken' GPU of the pcie bus and reconfigure slurm with one less GPU. Ward
Hello, I don't have a solution to offer, and this might be a bit off-topic, but I just wanted to say that I'm glad to know we are not the only ones dealing with NVIDIA issues. We are experiencing this "fallen off the bus" error on ~20% of our brand-new nodes (especially the H200 provided by Lenovo). Finding an easy workaround using Slurm would be great, but it's a shame that NVIDIA is experiencing these issues. I've never seen such a high failure rate on hardware this expensive. </offtopic> Regards, -- Bartomeu Miró Mateu Centre de Supercomputació i Intel·ligència Artificial de les Illes Balears Àrea de Suport Experimental i Serveis Cientificotècnics Universitat de les Illes Balears https://bsai.uib.cat/
participants (3)
-
Bartomeu Miró Mateu -
Ratnasamy, Fritz -
Ward Poelmans