[slurm-users] Increasing SlurmdTimeout beyond 300 Seconds

12 Feb 2024


      Hi,
We've been experiencing issues with network saturation on our older nodes caused by storage (GPFS) backups. This causes slurmctld to loose contact with slurmd on some compute nodes and for user jobs to be killed. While the longer term solution is to replace these and upgrade the network, I'm wondering if there are any ramifications, beyond nodes with genuine issues taking longer to get marked down, by increasing SlurmdTimeout. We've already applied a modest increase which has helped but not resolved the issue and wondering if we should push it further in the interim.
Kind Regards
Andy Baughan
HPC Systems Developer

2026

2025

2024

[slurm-users] Increasing SlurmdTimeout beyond 300 Seconds