[slurm-users] Slurm forgetting about job dependencies

Eli V eliventer at gmail.com
Wed Dec 19 15:54:49 MST 2018


Looking through the slurm.conf docs and greping around the source code
it looks like MinJobAge might be what I need to adjust. I changed it
by 2 orders of magnitude, 300 -> 300_000 on our dev cluster. I'll see
how things go.

On Wed, Dec 19, 2018 at 1:14 PM Eli V <eliventer at gmail.com> wrote:
>
> Does slurm remove job completion info from it's memory after a while?
> Might explain a why I'm seeing job's getting cancled when there
> dependent predecessor step finished ok. Below is the egrep
> '352209(1|2)_11' from slurmctld.log. The 3522092 job array was created
> with -d aftercorr:3522091. Looks like the predecessor job finished
> successfully at 9:21, and was split out to run at 9:30, never run and
> then canceled at 10:38 because of an unsatisfied job dependency on the
> job that already completed over an hour ago. Is there some config in
> slurm.conf that will keep this completion info around longer, or is
> this just a flat out bug in the slurmctld?
>
> [2018-12-19T08:48:52.632] backfill: Started JobId=3522091_11(3522113)
> in low on r3-19
> [2018-12-19T09:21:22.914] _job_complete: JobId=3522091_11(3522113) WEXITSTATUS 0
> [2018-12-19T09:21:22.914] _job_complete: JobId=3522091_11(3522113) done
> [2018-12-19T09:30:07.922] build_job_queue: Split out
> JobId=3522092_11(3522317) for SLURM_DEPEND_AFTER_CORRESPOND use
> [2018-12-19T10:38:12.981] _kill_dependent: Job dependency can't be
> satisfied, cancelling JobId=3522092_11(3522317)



More information about the slurm-users mailing list