[slurm-users] How do you handle GPU node failures during long jobs?