The behavior of slurm jobs using the ‘afterok’ dependency seems to have issues with ephemeral compute nodes such as those in a cloud cluster. If the specified jobid dependency is associated with a compute node that has already spun down then a subsequent job that requires successful completion of the prior job will fail with a “Job dependency problem”. This occurs when the subsequent job is tied to a node that must spin up before beginning execution. This phenomenon does not occur if the ‘afterany’ dependency is used. It seems that job completion status is retained when a node is spun down but no information as to whether the job was successfully executed is saved. There are perhaps other scenarios that coud cause the same issue. Has anyone else witnessed this problem? How can it be avoided?
afterany This job can begin execution after the specified jobs have terminated. aftercorr A task of this job array can begin execution after the corresponding task ID in the specified job has completed successfully afternotok This job can begin execution after the specified jobs have terminated in some failed state afterok This job can begin execution after the specified jobs have successfully executed