On 5/4/2026 5:33 PM, Pharthiphan Asokan via slurm-users wrote:
> We’re seeing an issue where jobs submitted via |salloc| are
> automatically cancelled when a compute node becomes temporarily unreachable.
> Our goal is to keep jobs pending or requeued instead of being cancelled
> outright when a node drops offline briefly
> Slurm sometimes cancels the job rather than requeuing it when the node
> is marked |DOWN/ DRAIN/ DRAINING*|.
>
> Is there a recommended configuration or additional parameter that
> ensures jobs remain pending/requeued until the node returns, rather than
> being cancelled?
DOWN nodes are very likely caused (rightly so, IMHO) by the
SlurmdTimeout in slurm.conf
> The interval, in seconds, that the Slurm controller waits for slurmd to respond before configuring that node's state to DOWN.
See https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout
The JobRequeue parameter controls job requeue.
IHTH,
Ole
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com