Hi,

We’re seeing an issue where jobs submitted via salloc are automatically cancelled when a compute node becomes temporarily unreachable.

Our goal is to keep jobs pending or requeued instead of being cancelled outright when a node drops offline briefly

Slurm sometimes cancels the job rather than requeuing it when the node is marked DOWN/ DRAIN/ DRAINING*.

Is there a recommended configuration or additional parameter that ensures jobs remain pending/requeued until the node returns, rather than being cancelled?

Any insights or examples from similar setups would be greatly appreciated.

Regards,

Pharthiphan