I would suggest making very sure that all compute nodes are time synced properly. Then look at logs from the slurm controller and a computer mode side by side in two windows. Why are these nodes not in contact with the slurm controller? On Tue, May 5, 2026, 7:35 PM Ole Holm Nielsen via slurm-users < slurm-users@lists.schedmd.com> wrote:
On 5/4/2026 5:33 PM, Pharthiphan Asokan via slurm-users wrote:
We’re seeing an issue where jobs submitted via |salloc| are automatically cancelled when a compute node becomes temporarily unreachable. Our goal is to keep jobs pending or requeued instead of being cancelled outright when a node drops offline briefly Slurm sometimes cancels the job rather than requeuing it when the node is marked |DOWN/ DRAIN/ DRAINING*|.
Is there a recommended configuration or additional parameter that ensures jobs remain pending/requeued until the node returns, rather than being cancelled?
DOWN nodes are very likely caused (rightly so, IMHO) by the SlurmdTimeout in slurm.conf
The interval, in seconds, that the Slurm controller waits for slurmd to respond before configuring that node's state to DOWN. See https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout
The JobRequeue parameter controls job requeue.
IHTH, Ole
-- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com