On 3/14/26 11:46 pm, Antonio Jose Alonso-Stepanov via slurm-users wrote:
When a GPU node goes down mid-job, do most sites use Slurm's requeue or --no-kill to handle it, or is it mostly manual drain and resubmit?
That we leave to our users on how best they want to deal with it.
Is anyone using HealthCheckProgram to catch GPU issues (like ECC errors via DCGM), or do you handle GPU health monitoring outside of Slurm?
We run non-intrusive checks via the health check script (so dumping the XML for instance and parsing that for problems) and if we find any we'll either drain the node (if it's a hardware issue that needs attention) or queue it for a reboot with "scontrol reboot" if it's just a remap issue.
In the job epilog we run any tests of GPUs that use resources on the GPU (eg "dcgmi diag -r 1") and if we find a problem we'll fail the epilog to drain the node.
All the best, Chris