Hi all,
I'm a Stanford CS student looking into how sites handle GPU node failures during long-running jobs. A couple questions:
When a GPU node goes down mid-job, do most sites use Slurm's requeue or --no-kill to handle it, or is it mostly manual drain and resubmit?
Is anyone using HealthCheckProgram to catch GPU issues (like ECC errors via DCGM), or do you handle GPU health monitoring outside of Slurm?
Curious what's worked and what hasn't. Thanks.
Antonio