Hi,
We’ve been experiencing issues with network saturation on our older nodes caused by storage (GPFS) backups. This causes slurmctld to loose contact with slurmd on some compute nodes and for user jobs to be killed. While the longer term solution
is to replace these and upgrade the network, I’m wondering if there are any ramifications, beyond nodes with genuine issues taking longer to get marked down, by increasing SlurmdTimeout. We’ve already applied a modest increase which has helped but not resolved
the issue and wondering if we should push it further in the interim.
Kind Regards
Andy Baughan
HPC Systems Developer