[slurm-users] Re: Jobs canceling when nodes become unreachable – need guidance

5 May 2026


      On 5/4/2026 5:33 PM, Pharthiphan Asokan via slurm-users wrote:
...
We’re seeing an issue where jobs submitted via |salloc| are 
automatically cancelled when a compute node becomes temporarily unreachable.
Our goal is to keep jobs pending or requeued instead of being cancelled 
outright when a node drops offline briefly
Slurm sometimes cancels the job rather than requeuing it when the node 
is marked |DOWN/ DRAIN/ DRAINING*|.
Is there a recommended configuration or additional parameter that 
ensures jobs remain pending/requeued until the node returns, rather than 
being cancelled?
DOWN nodes are very likely caused (rightly so, IMHO) by the 
SlurmdTimeout in slurm.conf
...
The interval, in seconds, that the Slurm controller waits for slurmd to respond before configuring that node's state to DOWN. 
See https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout
The JobRequeue parameter controls job requeue.

IHTH,
Ole

-- 
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark