27 May
2026
27 May
'26
11:29 a.m.
Hello, I don't have a solution to offer, and this might be a bit off-topic, but I just wanted to say that I'm glad to know we are not the only ones dealing with NVIDIA issues. We are experiencing this "fallen off the bus" error on ~20% of our brand-new nodes (especially the H200 provided by Lenovo). Finding an easy workaround using Slurm would be great, but it's a shame that NVIDIA is experiencing these issues. I've never seen such a high failure rate on hardware this expensive. </offtopic> Regards, -- Bartomeu Miró Mateu Centre de Supercomputació i Intel·ligència Artificial de les Illes Balears Àrea de Suport Experimental i Serveis Cientificotècnics Universitat de les Illes Balears https://bsai.uib.cat/