[slurm-users] Dependencies with singleton and after

Kevin Buckley Kevin.Buckley at pawsey.org.au
Thu Aug 22 07:33:36 UTC 2019


On 2019/08/22 04:51, Jarno van der Kolk wrote:
> Hi,
> 
> I am helping a researcher who encountered an unexpected behaviour with dependencies. He uses both "singleton" and "after". The minimal working example is as follows:
> 
> $ sbatch --hold fakejob.sh
> Submitted batch job 25909273
> $ sbatch --hold fakejob.sh
> Submitted batch job 25909274
> $ sbatch --hold fakejob.sh
> Submitted batch job 25909275
> $ scontrol update jobid=25909273 Dependency=singleton
> $ scontrol update jobid=25909274 Dependency=singleton,after:25909275
> $ scontrol update jobid=25909275 Dependency=singleton,after:25909273
> $ scontrol release 25909273 25909274 25909275
> 
> When releasing the jobs, the scheduler will start job 25909273 which is to be expected. The other jobs will be held due to the singleton and the jobs having the same job name, also expected.
> 
> However, when the job finishes, we would have expected job 25909275 to start since the singleton is now free and job 25909274 cannot start due to its dependency of "after:25909275". That is, the expected order would be 25909273 25909275 25909274 and one at a time.
> 
> Instead what happens is that job 25909273 starts and completes and then jobs 25909274 and 25909275 remain queued with unsatisfied dependencies.
> 
> It is entirely possible that I am thinking of this wrong of course, but I don't see it. Is this expected behaviour?
> 

Given that the definition of the "singleton" dependency

    This job can begin execution after any previously launched jobs
    sharing the same job name and user have terminated.  In other
    words, only one job by that name and owned by that user can be
    running or suspended at any point in time.

contains the word "any", are you, perhaps, introducing a circular
dependency ?


It may seem obvious that

25909274 can't start because the explicit "after:25909275"

but perhaps

25909275 can't start (and terminate) because it's "waiting" on another
job with the same name, 25909274, to terminate, because that job existed
at the time that its "singleton" was defined.


Not saying that that's the intention of the dependency conditions,
but maybe a not impossible interpretation.


It might also be worth considering if a job submitted in a "held" state
counts as "suspended" even though it was never "launched" and then "held"?


Kevin

-- 
Supercomputing Systems Administrator
Pawsey Supercomputing Centre
Tel: +61 8 6436 8902
SMS: +61 4 9970 3915
Eml: kevin.buckley at pawsey.org.au



More information about the slurm-users mailing list