Hi,

We’ve been experiencing issues with network saturation on our older nodes caused by storage (GPFS) backups. This causes slurmctld to loose contact with slurmd on some compute nodes and for user jobs to be killed. While the longer term solution is to replace these and upgrade the network, I’m wondering if there are any ramifications, beyond nodes with genuine issues taking longer to get marked down, by increasing SlurmdTimeout. We’ve already applied a modest increase which has helped but not resolved the issue and wondering if we should push it further in the interim.

Kind Regards

Andy Baughan

HPC Systems Developer