On 5/4/2026 5:33 PM, Pharthiphan Asokan via slurm-users wrote:
We’re seeing an issue where jobs submitted via |salloc| are automatically cancelled when a compute node becomes temporarily unreachable. Our goal is to keep jobs pending or requeued instead of being cancelled outright when a node drops offline briefly Slurm sometimes cancels the job rather than requeuing it when the node is marked |DOWN/ DRAIN/ DRAINING*|.
Is there a recommended configuration or additional parameter that ensures jobs remain pending/requeued until the node returns, rather than being cancelled?
DOWN nodes are very likely caused (rightly so, IMHO) by the SlurmdTimeout in slurm.conf
The interval, in seconds, that the Slurm controller waits for slurmd to respond before configuring that node's state to DOWN. See https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout
The JobRequeue parameter controls job requeue. IHTH, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark