Slurm Power Saving Guide: Why doesnt slurm mark as failed when resumeProgram returns =/= 0
19 Feb
2024
19 Feb
'24
3:36 p.m.
Dear slurm-user list, I had cases where our resumeProgram failed due to temporary cloud timeouts. In that case the resumeProgram returns a value =/= 0. Why does Slurm still wait until resumeTimeout instead of just accepting the startup as failed which then should lead to a rescheduling of the job. Is there some way to achieve the described effect i.e. tell Slurm: "You can stop waiting, the node won't come alive." or am I missing the correct way how this should be handled in Slurm? Best regards, Xaver
788
Age (days ago)
788
Last active (days ago)
0 comments
1 participants
participants (1)
-
Xaver Stiensmeier