[slurm-users] Slurm forgetting about job dependencies

Eli V eliventer at gmail.com
Wed Dec 19 11:14:07 MST 2018


Does slurm remove job completion info from it's memory after a while?
Might explain a why I'm seeing job's getting cancled when there
dependent predecessor step finished ok. Below is the egrep
'352209(1|2)_11' from slurmctld.log. The 3522092 job array was created
with -d aftercorr:3522091. Looks like the predecessor job finished
successfully at 9:21, and was split out to run at 9:30, never run and
then canceled at 10:38 because of an unsatisfied job dependency on the
job that already completed over an hour ago. Is there some config in
slurm.conf that will keep this completion info around longer, or is
this just a flat out bug in the slurmctld?

[2018-12-19T08:48:52.632] backfill: Started JobId=3522091_11(3522113)
in low on r3-19
[2018-12-19T09:21:22.914] _job_complete: JobId=3522091_11(3522113) WEXITSTATUS 0
[2018-12-19T09:21:22.914] _job_complete: JobId=3522091_11(3522113) done
[2018-12-19T09:30:07.922] build_job_queue: Split out
JobId=3522092_11(3522317) for SLURM_DEPEND_AFTER_CORRESPOND use
[2018-12-19T10:38:12.981] _kill_dependent: Job dependency can't be
satisfied, cancelling JobId=3522092_11(3522317)



More information about the slurm-users mailing list