I would suggest making very sure that all compute nodes are time synced properly.
Then look at logs from the slurm controller and a computer mode side by side in two windows.


Why are these nodes not in contact with the slurm controller?

On Tue, May 5, 2026, 7:35 PM Ole Holm Nielsen via slurm-users <slurm-users@lists.schedmd.com> wrote:
On 5/4/2026 5:33 PM, Pharthiphan Asokan via slurm-users wrote:
> We’re seeing an issue where jobs submitted via |salloc| are
> automatically cancelled when a compute node becomes temporarily unreachable.
> Our goal is to keep jobs pending or requeued instead of being cancelled
> outright when a node drops offline briefly
> Slurm sometimes cancels the job rather than requeuing it when the node
> is marked |DOWN/ DRAIN/ DRAINING*|.
>
> Is there a recommended configuration or additional parameter that
> ensures jobs remain pending/requeued until the node returns, rather than
> being cancelled?

DOWN nodes are very likely caused (rightly so, IMHO) by the
SlurmdTimeout in slurm.conf
> The interval, in seconds, that the Slurm controller waits for slurmd to respond before configuring that node's state to DOWN.
See https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout

The JobRequeue parameter controls job requeue.

IHTH,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com