[slurm-users] Re: Jobs canceling when nodes become unreachable – need guidance

5 May 2026

      I would suggest making very sure that all compute nodes are time synced
properly.
Then look at logs from the slurm controller and a computer mode side by
side in two windows.

Why are these nodes not in contact with the slurm controller?

On Tue, May 5, 2026, 7:35 PM Ole Holm Nielsen via slurm-users <
slurm-users@lists.schedmd.com> wrote:
...
On 5/4/2026 5:33 PM, Pharthiphan Asokan via slurm-users wrote:
...
We’re seeing an issue where jobs submitted via |salloc| are
automatically cancelled when a compute node becomes temporarily
unreachable.
Our goal is to keep jobs pending or requeued instead of being cancelled
outright when a node drops offline briefly
Slurm sometimes cancels the job rather than requeuing it when the node
is marked |DOWN/ DRAIN/ DRAINING*|.
Is there a recommended configuration or additional parameter that
ensures jobs remain pending/requeued until the node returns, rather than
being cancelled?
DOWN nodes are very likely caused (rightly so, IMHO) by the
SlurmdTimeout in slurm.conf
...
The interval, in seconds, that the Slurm controller waits for slurmd to
respond before configuring that node's state to DOWN.
See https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout
The JobRequeue parameter controls job requeue.
IHTH,
Ole
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

[slurm-users] Re: Jobs canceling when nodes become unreachable – need guidance

John Hearns