[slurm-users] Re: How do you handle GPU node failures during long jobs?

26 Mar 2026


      Hello,
at least for nvidia GPUs, we have the Node Health Check check dcgmi 
health output - so we have health watchers set on the GPU, and if dcgmi 
reports errors, that drains the nodes. We're trying to do something 
similar for our AMD GPUs but there doesn't seem to be a 'live' health 
check like that, so on those we periodically run a diagnostics script & 
check the output of that as part of NHC.
We've also found failure conditions on some of our GPU nodes that dcgmi 
health watchers don't pick up on, and have implemented separate checks 
for those (again, they've been added to the NHC script).
My opinion is that it's always better to have the HealthCheckProgram 
pick up on errors, rather than rely on 'manual' discovery.
We don't do anything about jobs on the nodes - I mean if a GPU dies 
mid-job the job(s) using the GPU(s) will likely fail anyway, and the 
node goes into drain state, so...
Tina
On 15/03/2026 03:46, Antonio Jose Alonso-Stepanov via slurm-users wrote:
...
Hi all,
I'm a Stanford CS student looking into how sites handle GPU node 
failures during long-running jobs. A couple questions:
When a GPU node goes down mid-job, do most sites use Slurm's requeue or 
--no-kill to handle it, or is it mostly manual drain and resubmit?
Is anyone using HealthCheckProgram to catch GPU issues (like ECC errors 
via DCGM), or do you handle GPU health monitoring outside of Slurm?
Curious what's worked and what hasn't. Thanks.
Antonio
-- 
Tina Friedrich, Snr HPC Systems Administrator,
Advanced Research Computing (ARC), The University of Oxford
https://www.arc.ox.ac.uk/

2026

2025

2024

[slurm-users] Re: How do you handle GPU node failures during long jobs?