Hi,
We've been experiencing issues with network saturation on our older nodes caused by storage (GPFS) backups. This causes slurmctld to loose contact with slurmd on some compute nodes and for user jobs to be killed. While the longer term solution is to replace these and upgrade the network, I'm wondering if there are any ramifications, beyond nodes with genuine issues taking longer to get marked down, by increasing SlurmdTimeout. We've already applied a modest increase which has helped but not resolved the issue and wondering if we should push it further in the interim.
Kind Regards Andy Baughan HPC Systems Developer
We've been running one cluster with SlurmdTimeout = 1200 sec for a couple of years now, and I haven't seen any problems due to that.
We set SlurmdTimeout=600. The docs say not to go any higher than 65533 seconds:
https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout
The FAQ has info about SlurmdTimeout also. The worst thing that could happen is will take longer to set nodes as being down:
A node is set DOWN when the slurmd daemon on it stops responding for SlurmdTimeout as defined in slurm.conf.
https://slurm.schedmd.com/faq.html
I wouldn't set it too high, but too high vs too low will vary from site to site and how busy your controllers are and how busy your network is.
Regards --Mick ________________________________ From: Bjørn-Helge Mevik via slurm-users slurm-users@lists.schedmd.com Sent: Monday, February 12, 2024 7:16 AM To: slurm-users@schedmd.com slurm-users@schedmd.com Subject: [slurm-users] Re: Increasing SlurmdTimeout beyond 300 Seconds
We've been running one cluster with SlurmdTimeout = 1200 sec for a couple of years now, and I haven't seen any problems due to that.
-- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo
We'd bumped ours up for a while 20+ years ago when we had a flaky network connection between two buildings holding our compute nodes. If you need more than 600s you have networking problems.
On Mon, Feb 12, 2024 at 5:41 PM Timony, Mick via slurm-users < slurm-users@lists.schedmd.com> wrote:
We set SlurmdTimeout=600. The docs say not to go any higher than 65533 seconds:
https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout
The FAQ has info about SlurmdTimeout also. The worst thing that could happen is will take longer to set nodes as being down:
A node is set DOWN when the slurmd daemon on it stops responding for
SlurmdTimeout as defined in slurm.conf.
https://slurm.schedmd.com/faq.html
I wouldn't set it too high, but too high vs too low will vary from site to site and how busy your controllers are and how busy your network is.
Regards
--Mick
*From:* Bjørn-Helge Mevik via slurm-users slurm-users@lists.schedmd.com *Sent:* Monday, February 12, 2024 7:16 AM *To:* slurm-users@schedmd.com slurm-users@schedmd.com *Subject:* [slurm-users] Re: Increasing SlurmdTimeout beyond 300 Seconds
We've been running one cluster with SlurmdTimeout = 1200 sec for a couple of years now, and I haven't seen any problems due to that.
-- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com