Hello,
at least for nvidia GPUs, we have the Node Health Check check dcgmi health output - so we have health watchers set on the GPU, and if dcgmi reports errors, that drains the nodes. We're trying to do something similar for our AMD GPUs but there doesn't seem to be a 'live' health check like that, so on those we periodically run a diagnostics script & check the output of that as part of NHC.
We've also found failure conditions on some of our GPU nodes that dcgmi health watchers don't pick up on, and have implemented separate checks for those (again, they've been added to the NHC script).
My opinion is that it's always better to have the HealthCheckProgram pick up on errors, rather than rely on 'manual' discovery.
We don't do anything about jobs on the nodes - I mean if a GPU dies mid-job the job(s) using the GPU(s) will likely fail anyway, and the node goes into drain state, so...
Tina
On 15/03/2026 03:46, Antonio Jose Alonso-Stepanov via slurm-users wrote:
Hi all,
I'm a Stanford CS student looking into how sites handle GPU node failures during long-running jobs. A couple questions:
When a GPU node goes down mid-job, do most sites use Slurm's requeue or --no-kill to handle it, or is it mostly manual drain and resubmit?
Is anyone using HealthCheckProgram to catch GPU issues (like ECC errors via DCGM), or do you handle GPU health monitoring outside of Slurm?
Curious what's worked and what hasn't. Thanks.
Antonio