[slurm-users] Re: How do you handle GPU node failures during long jobs?

26 Mar 2026


      On 3/14/26 11:46 pm, Antonio Jose Alonso-Stepanov via slurm-users wrote:
...
When a GPU node goes down mid-job, do most sites use Slurm's requeue or 
--no-kill to handle it, or is it mostly manual drain and resubmit?
That we leave to our users on how best they want to deal with it.
...
Is anyone using HealthCheckProgram to catch GPU issues (like ECC errors 
via DCGM), or do you handle GPU health monitoring outside of Slurm?
We run non-intrusive checks via the health check script (so dumping the 
XML for instance and parsing that for problems) and if we find any we'll 
either drain the node (if it's a hardware issue that needs attention) or 
queue it for a reboot with "scontrol reboot" if it's just a remap issue.
In the job epilog we run any tests of GPUs that use resources on the GPU 
(eg "dcgmi diag -r 1") and if we find a problem we'll fail the epilog to 
drain the node.
All the best,
Chris
-- 
Chris Samuel  :  http://www.csamuel.org/  :  Philadelphia, PA, USA

2026

2025

2024

[slurm-users] Re: How do you handle GPU node failures during long jobs?