Dear Slurm-user list,
when a job fails because the node startup fails (cloud scheduling), the job should be re-queued:
Resume Timeout
Maximum time permitted (in seconds) between when a node resume request is issued and when the node is actually available for use. Nodes which fail to respond in this time frame will be marked DOWN and the jobs scheduled on the node requeued.
however, instead of requeuing the job, it is killed.
[2024-11-18T10:41:52.003] node bibigrid-worker-wubqboa1z2kkgx0-0
not resumed by ResumeTimeout(1200) - marking down and power_save
[2024-11-18T10:41:52.003] Killing JobId=1 on failed node
bibigrid-worker-wubqboa1z2kkgx0-0
[2024-11-18T10:41:52.046] update_node: node
bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup
[2024-11-18T10:41:52.046] power down request repeating for node
bibigrid-worker-wubqboa1z2kkgx0-0
Our ResumeProgram does not change the state of the underlying workers, I think we should set the nodes to DOWN explicitly if the startup fails given:
ResumeProgram is unable to restore a node to service with a responding slurmd and an updated BootTime, it should set the node state to DOWN, which will result in a requeue of any job associated with the node - this will happen automatically if the node doesn't register within ResumeTimeout
but in any case as we can see in the log the job should be
requeued based on it reaching the ResumeTimeout alone. I am unsure
why that is not happening. The power down request is sent by the
ResumeFailProgram. We have
SlurmctldParameters=idle_on_node_suspend activated, but that
shouldn't affect Resume, I guess.
My Slurm version is slurm 23.11.5
Best regards,
Xaver
# More context
## Slurmctld from submitting job to failure
[2024-11-18T10:21:45.490] sched: _slurm_rpc_allocate_resources
JobId=1 NodeList=bibigrid-worker-wubqboa1z2kkgx0-0 usec=1221
[2024-11-18T10:21:45.499] debug: sackd_mgr_dump_state: saved
state of 0 nodes
[2024-11-18T10:21:58.387] debug: sched/backfill:
_attempt_backfill: beginning
[2024-11-18T10:21:58.387] debug: sched/backfill:
_attempt_backfill: no jobs to backfill
[2024-11-18T10:22:20.003] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:23:20.003] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:23:20.009] debug: sackd_mgr_dump_state: saved
state of 0 nodes
[2024-11-18T10:23:23.003] debug: Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:23:23.398] debug: sched/backfill:
_attempt_backfill: beginning
[2024-11-18T10:23:23.398] debug: sched/backfill:
_attempt_backfill: no jobs to backfill
[2024-11-18T10:23:53.398] debug: sched/backfill:
_attempt_backfill: beginning
[2024-11-18T10:23:53.398] debug: sched/backfill:
_attempt_backfill: no jobs to backfill
[2024-11-18T10:24:21.000] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:24:21.484] slurmscriptd: error: _run_script:
JobId=0 resumeprog exit status 1:0
[2024-11-18T10:25:20.003] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:26:02.000] debug: Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:26:02.417] debug: sched/backfill:
_attempt_backfill: beginning
[2024-11-18T10:26:02.417] debug: sched/backfill:
_attempt_backfill: no jobs to backfill
[2024-11-18T10:26:20.007] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:26:32.417] debug: sched/backfill:
_attempt_backfill: beginning
[2024-11-18T10:26:32.417] debug: sched/backfill:
_attempt_backfill: no jobs to backfill
[2024-11-18T10:27:20.003] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:28:20.003] debug: Updating partition uid access
list
[2024-11-18T10:28:20.003] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:28:20.008] debug: sackd_mgr_dump_state: saved
state of 0 nodes
[2024-11-18T10:29:20.003] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:29:22.000] debug: Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:29:22.448] debug: sched/backfill:
_attempt_backfill: beginning
[2024-11-18T10:29:22.448] debug: sched/backfill:
_attempt_backfill: no jobs to backfill
[2024-11-18T10:30:20.007] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:31:20.003] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:32:21.000] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:32:42.000] debug: Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:32:42.478] debug: sched/backfill:
_attempt_backfill: beginning
[2024-11-18T10:32:42.478] debug: sched/backfill:
_attempt_backfill: no jobs to backfill
[2024-11-18T10:33:12.479] debug: sched/backfill:
_attempt_backfill: beginning
[2024-11-18T10:33:12.479] debug: sched/backfill:
_attempt_backfill: no jobs to backfill
[2024-11-18T10:33:20.003] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:33:20.010] debug: sackd_mgr_dump_state: saved
state of 0 nodes
[2024-11-18T10:34:20.003] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:35:20.007] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:36:01.004] debug: Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:36:01.504] debug: sched/backfill:
_attempt_backfill: beginning
[2024-11-18T10:36:01.504] debug: sched/backfill:
_attempt_backfill: no jobs to backfill
[2024-11-18T10:36:20.003] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:36:31.505] debug: sched/backfill:
_attempt_backfill: beginning
[2024-11-18T10:36:31.505] debug: sched/backfill:
_attempt_backfill: no jobs to backfill
[2024-11-18T10:37:21.000] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:38:20.008] debug: Updating partition uid access
list
[2024-11-18T10:38:20.008] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:38:20.017] debug: sackd_mgr_dump_state: saved
state of 0 nodes
[2024-11-18T10:39:20.003] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:39:21.003] debug: Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:39:21.530] debug: sched/backfill:
_attempt_backfill: beginning
[2024-11-18T10:39:21.530] debug: sched/backfill:
_attempt_backfill: no jobs to backfill
[2024-11-18T10:39:51.531] debug: sched/backfill:
_attempt_backfill: beginning
[2024-11-18T10:39:51.531] debug: sched/backfill:
_attempt_backfill: no jobs to backfill
[2024-11-18T10:40:21.000] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:41:20.003] debug: sched: Running job scheduler for
full queue.
[2024-11-18T10:41:52.003] node bibigrid-worker-wubqboa1z2kkgx0-0
not resumed by ResumeTimeout(1200) - marking down and power_save
[2024-11-18T10:41:52.003] Killing JobId=1 on failed node
bibigrid-worker-wubqboa1z2kkgx0-0
[2024-11-18T10:41:52.046] update_node: node
bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup
[2024-11-18T10:41:52.046] power down request repeating for node
bibigrid-worker-wubqboa1z2kkgx0-0
[2024-11-18T10:41:52.047] debug: sackd_mgr_dump_state: saved
state of 0 nodes
[2024-11-18T10:41:52.549] debug: sched/backfill:
_attempt_backfill: beginning
[2024-11-18T10:41:52.549] debug: sched/backfill:
_attempt_backfill: no jobs to backfill
[2024-11-18T10:41:52.736] _slurm_rpc_complete_job_allocation:
JobId=1 error Job/step already completing or completed
[2024-11-18T10:41:53.000] debug: Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:41:53.000] debug: sched: Running job scheduler for
default depth.
[2024-11-18T10:41:53.014] update_node: node
bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup
[2024-11-18T10:41:53.014] update_node: node
bibigrid-worker-wubqboa1z2kkgx0-0 state set to IDLE