[slurm-users] Why is my job killed when ResumeTimeout is reached instead of it being requeued?

18 Nov 2024


      Dear Slurm-user list,
when a job fails because the node startup fails (cloud scheduling), the
job should be re-queued:
Resume Timeout
    Maximum time permitted (in seconds) between when a node resume
    request is issued and when the node is actually available for use.
    Nodes which fail to respond in this time frame will be marked DOWN
    and the jobs scheduled on the node requeued.
however, instead of requeuing the job, it is killed.
[2024-11-18T10:41:52.003] node bibigrid-worker-wubqboa1z2kkgx0-0 not
resumed by ResumeTimeout(1200) - marking down and power_save
[2024-11-18T10:41:52.003] Killing JobId=1 on failed node
bibigrid-worker-wubqboa1z2kkgx0-0
[2024-11-18T10:41:52.046] update_node: node
bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup
[2024-11-18T10:41:52.046] power down request repeating for node
bibigrid-worker-wubqboa1z2kkgx0-0
Our ResumeProgram does not change the state of the underlying workers, I
think we should set the nodes to DOWN explicitly if the startup fails given:
*ResumeProgram* is unable to restore a node to service with a
    responding slurmd and an updated BootTime, it should set the node
    state to DOWN, which will result in a requeue of any job associated
    with the node - this will happen automatically if the node doesn't
    register within ResumeTimeout
but in any case as we can see in the log the job should be requeued
based on it reaching the ResumeTimeout alone. I am unsure why that is
not happening. The power down request is sent by the ResumeFailProgram.
We have SlurmctldParameters=idle_on_node_suspend activated, but that
shouldn't affect Resume, I guess.
My Slurm version is slurm 23.11.5
Best regards,
Xaver
# More context
## Slurmctld from submitting job to failure
[2024-11-18T10:21:45.490] sched: _slurm_rpc_allocate_resources JobId=1
NodeList=bibigrid-worker-wubqboa1z2kkgx0-0 usec=1221
[2024-11-18T10:21:45.499] debug:  sackd_mgr_dump_state: saved state of 0
nodes
[2024-11-18T10:21:58.387] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:21:58.387] debug:  sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:22:20.003] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:23:20.003] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:23:20.009] debug:  sackd_mgr_dump_state: saved state of 0
nodes
[2024-11-18T10:23:23.003] debug:  Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:23:23.398] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:23:23.398] debug:  sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:23:53.398] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:23:53.398] debug:  sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:24:21.000] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:24:21.484] slurmscriptd: error: _run_script: JobId=0
resumeprog exit status 1:0
[2024-11-18T10:25:20.003] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:26:02.000] debug:  Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:26:02.417] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:26:02.417] debug:  sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:26:20.007] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:26:32.417] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:26:32.417] debug:  sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:27:20.003] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:28:20.003] debug:  Updating partition uid access list
[2024-11-18T10:28:20.003] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:28:20.008] debug:  sackd_mgr_dump_state: saved state of 0
nodes
[2024-11-18T10:29:20.003] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:29:22.000] debug:  Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:29:22.448] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:29:22.448] debug:  sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:30:20.007] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:31:20.003] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:32:21.000] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:32:42.000] debug:  Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:32:42.478] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:32:42.478] debug:  sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:33:12.479] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:33:12.479] debug:  sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:33:20.003] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:33:20.010] debug:  sackd_mgr_dump_state: saved state of 0
nodes
[2024-11-18T10:34:20.003] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:35:20.007] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:36:01.004] debug:  Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:36:01.504] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:36:01.504] debug:  sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:36:20.003] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:36:31.505] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:36:31.505] debug:  sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:37:21.000] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:38:20.008] debug:  Updating partition uid access list
[2024-11-18T10:38:20.008] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:38:20.017] debug:  sackd_mgr_dump_state: saved state of 0
nodes
[2024-11-18T10:39:20.003] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:39:21.003] debug:  Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:39:21.530] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:39:21.530] debug:  sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:39:51.531] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:39:51.531] debug:  sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:40:21.000] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:41:20.003] debug:  sched: Running job scheduler for full
queue.
[2024-11-18T10:41:52.003] node bibigrid-worker-wubqboa1z2kkgx0-0 not
resumed by ResumeTimeout(1200) - marking down and power_save
[2024-11-18T10:41:52.003] Killing JobId=1 on failed node
bibigrid-worker-wubqboa1z2kkgx0-0
[2024-11-18T10:41:52.046] update_node: node
bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup
[2024-11-18T10:41:52.046] power down request repeating for node
bibigrid-worker-wubqboa1z2kkgx0-0
[2024-11-18T10:41:52.047] debug:  sackd_mgr_dump_state: saved state of 0
nodes
[2024-11-18T10:41:52.549] debug:  sched/backfill: _attempt_backfill:
beginning
[2024-11-18T10:41:52.549] debug:  sched/backfill: _attempt_backfill: no
jobs to backfill
[2024-11-18T10:41:52.736] _slurm_rpc_complete_job_allocation: JobId=1
error Job/step already completing or completed
[2024-11-18T10:41:53.000] debug:  Spawning ping agent for
bibigrid-master-wubqboa1z2kkgx0
[2024-11-18T10:41:53.000] debug:  sched: Running job scheduler for
default depth.
[2024-11-18T10:41:53.014] update_node: node
bibigrid-worker-wubqboa1z2kkgx0-0 reason set to: FailedStartup
[2024-11-18T10:41:53.014] update_node: node
bibigrid-worker-wubqboa1z2kkgx0-0 state set to IDLE

2025

2024

[slurm-users] Why is my job killed when ResumeTimeout is reached instead of it being requeued?