19 Feb
2024
19 Feb
'24
3:36 p.m.
Dear slurm-user list, I had cases where our resumeProgram failed due to temporary cloud timeouts. In that case the resumeProgram returns a value =/= 0. Why does Slurm still wait until resumeTimeout instead of just accepting the startup as failed which then should lead to a rescheduling of the job. Is there some way to achieve the described effect i.e. tell Slurm: "You can stop waiting, the node won't come alive." or am I missing the correct way how this should be handled in Slurm? Best regards, Xaver