Dear Slurm-user list,

when a job fails because the node startup fails (cloud scheduling), the job should be re-queued:

Resume Timeout
Maximum time permitted (in seconds) between when a node resume request is issued and when the node is actually available for use. Nodes which fail to respond in this time frame will be marked DOWN and the jobs scheduled on the node requeued.

however, instead of requeuing the job, it is killed.

[2024-11-18T10:41:52.003] node bibigrid-worker-wubqboa1z2kkgx0-0 not resumed by ResumeTimeout(1200) - marking down and power_save
[2024-11-18T10:41:52.003] Killing JobId=1 on failed node bibigrid-worker-wubqboa1z2kkgx0-0
[2024-11-18T10:41:52.046] update_node: node bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup
[2024-11-18T10:41:52.046] power down request repeating for node bibigrid-worker-wubqboa1z2kkgx0-0

Our ResumeProgram does not change the state of the underlying workers, I think we should set the nodes to DOWN explicitly if the startup fails given:

ResumeProgram is unable to restore a node to service with a responding slurmd and an updated BootTime, it should set the node state to DOWN, which will result in a requeue of any job associated with the node - this will happen automatically if the node doesn't register within ResumeTimeout

but in any case as we can see in the log the job should be requeued based on it reaching the ResumeTimeout alone. I am unsure why that is not happening. The power down request is sent by the ResumeFailProgram. We have SlurmctldParameters=idle_on_node_suspend activated, but that shouldn't affect Resume, I guess.

My Slurm version is slurm 23.11.5

Best regards,
Xaver

# More context

## Slurmctld from submitting job to failure

[2024-11-18T10:21:45.490] sched: _slurm_rpc_allocate_resources JobId=1 NodeList=bibigrid-worker-wubqboa1z2kkgx0-0 usec=1221
[2024-11-18T10:21:45.499] debug: sackd_mgr_dump_state: saved state of 0 nodes
[2024-11-18T10:21:58.387] debug: sched/backfill: _attempt_backfill: beginning
[2024-11-18T10:21:58.387] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2024-11-18T10:22:20.003] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:23:20.003] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:23:20.009] debug: sackd_mgr_dump_state: saved state of 0 nodes
[2024-11-18T10:23:23.003] debug: Spawning ping agent for bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:23:23.398] debug: sched/backfill: _attempt_backfill: beginning
[2024-11-18T10:23:23.398] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2024-11-18T10:23:53.398] debug: sched/backfill: _attempt_backfill: beginning
[2024-11-18T10:23:53.398] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2024-11-18T10:24:21.000] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:24:21.484] slurmscriptd: error: _run_script: JobId=0 resumeprog exit status 1:0
[2024-11-18T10:25:20.003] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:26:02.000] debug: Spawning ping agent for bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:26:02.417] debug: sched/backfill: _attempt_backfill: beginning
[2024-11-18T10:26:02.417] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2024-11-18T10:26:20.007] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:26:32.417] debug: sched/backfill: _attempt_backfill: beginning
[2024-11-18T10:26:32.417] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2024-11-18T10:27:20.003] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:28:20.003] debug: Updating partition uid access list
[2024-11-18T10:28:20.003] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:28:20.008] debug: sackd_mgr_dump_state: saved state of 0 nodes
[2024-11-18T10:29:20.003] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:29:22.000] debug: Spawning ping agent for bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:29:22.448] debug: sched/backfill: _attempt_backfill: beginning
[2024-11-18T10:29:22.448] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2024-11-18T10:30:20.007] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:31:20.003] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:32:21.000] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:32:42.000] debug: Spawning ping agent for bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:32:42.478] debug: sched/backfill: _attempt_backfill: beginning
[2024-11-18T10:32:42.478] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2024-11-18T10:33:12.479] debug: sched/backfill: _attempt_backfill: beginning
[2024-11-18T10:33:12.479] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2024-11-18T10:33:20.003] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:33:20.010] debug: sackd_mgr_dump_state: saved state of 0 nodes
[2024-11-18T10:34:20.003] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:35:20.007] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:36:01.004] debug: Spawning ping agent for bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:36:01.504] debug: sched/backfill: _attempt_backfill: beginning
[2024-11-18T10:36:01.504] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2024-11-18T10:36:20.003] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:36:31.505] debug: sched/backfill: _attempt_backfill: beginning
[2024-11-18T10:36:31.505] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2024-11-18T10:37:21.000] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:38:20.008] debug: Updating partition uid access list
[2024-11-18T10:38:20.008] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:38:20.017] debug: sackd_mgr_dump_state: saved state of 0 nodes
[2024-11-18T10:39:20.003] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:39:21.003] debug: Spawning ping agent for bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:39:21.530] debug: sched/backfill: _attempt_backfill: beginning
[2024-11-18T10:39:21.530] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2024-11-18T10:39:51.531] debug: sched/backfill: _attempt_backfill: beginning
[2024-11-18T10:39:51.531] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2024-11-18T10:40:21.000] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:41:20.003] debug: sched: Running job scheduler for full queue.
[2024-11-18T10:41:52.003] node bibigrid-worker-wubqboa1z2kkgx0-0 not resumed by ResumeTimeout(1200) - marking down and power_save
[2024-11-18T10:41:52.003] Killing JobId=1 on failed node bibigrid-worker-wubqboa1z2kkgx0-0
[2024-11-18T10:41:52.046] update_node: node bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup
[2024-11-18T10:41:52.046] power down request repeating for node bibigrid-worker-wubqboa1z2kkgx0-0
[2024-11-18T10:41:52.047] debug: sackd_mgr_dump_state: saved state of 0 nodes
[2024-11-18T10:41:52.549] debug: sched/backfill: _attempt_backfill: beginning
[2024-11-18T10:41:52.549] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2024-11-18T10:41:52.736] _slurm_rpc_complete_job_allocation: JobId=1 error Job/step already completing or completed
[2024-11-18T10:41:53.000] debug: Spawning ping agent for bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:41:53.000] debug: sched: Running job scheduler for default depth.
[2024-11-18T10:41:53.014] update_node: node bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup
[2024-11-18T10:41:53.014] update_node: node bibigrid-worker-wubqboa1z2kkgx0-0 state set to IDLE