Hi,
We’re seeing an issue where jobs submitted via salloc are automatically cancelled when a compute node becomes temporarily unreachable.
Our goal is to keep jobs pending or requeued instead of being cancelled outright when a node drops offline briefly
Slurm sometimes cancels the job rather than requeuing it when the node is marked DOWN/ DRAIN/ DRAINING*.

Is there a recommended configuration or additional parameter that ensures jobs remain pending/requeued until the node returns, rather than being cancelled?

Any insights or examples from similar setups would be greatly appreciated.

Regards,
Pharthiphan