[slurm-users] Dependencies with singleton and after

Jarno van der Kolk jvanderk at uottawa.ca
Thu Aug 22 13:23:37 UTC 2019


> From: Kevin Buckley <Kevin.Buckley at pawsey.org.au>
> Sent: August 22, 2019 3:33 AM
> To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> Cc: Jarno van der Kolk <jvanderk at uottawa.ca>
> Subject: Re: [slurm-users] Dependencies with singleton and after
> 
> On 2019/08/22 04:51, Jarno van der Kolk wrote:
> > Hi,
> >
> > I am helping a researcher who encountered an unexpected behaviour with dependencies. He uses both "singleton" and "after". > The minimal working example is as follows:
> >
> > $ sbatch --hold fakejob.sh
> > Submitted batch job 25909273
> > $ sbatch --hold fakejob.sh
> > Submitted batch job 25909274
> > $ sbatch --hold fakejob.sh
> > Submitted batch job 25909275
> > $ scontrol update jobid=25909273 Dependency=singleton
> > $ scontrol update jobid=25909274 Dependency=singleton,after:25909275
> > $ scontrol update jobid=25909275 Dependency=singleton,after:25909273
> > $ scontrol release 25909273 25909274 25909275
> >
> > When releasing the jobs, the scheduler will start job 25909273 which is to be expected. The other jobs will be held due to the singleton and the jobs having the same job name, also expected.
> >
> > However, when the job finishes, we would have expected job 25909275 to start since the singleton is now free and job 25909274 cannot start due to its dependency of "after:25909275". That is, the expected order would be 25909273 25909275 25909274 and one at a time.
> >
> > Instead what happens is that job 25909273 starts and completes and then jobs 25909274 and 25909275 remain queued with unsatisfied dependencies.
> >
> > It is entirely possible that I am thinking of this wrong of course, but I don't see it. Is this expected behaviour?
> >
> 
> Given that the definition of the "singleton" dependency
> 
>     This job can begin execution after any previously launched jobs
>     sharing the same job name and user have terminated.  In other
>     words, only one job by that name and owned by that user can be
>     running or suspended at any point in time.
> 
> contains the word "any", are you, perhaps, introducing a circular
> dependency ?
> 
> 
> It may seem obvious that
> 
> 25909274 can't start because the explicit "after:25909275"
> 
> but perhaps
> 
> 25909275 can't start (and terminate) because it's "waiting" on another
> job with the same name, 25909274, to terminate, because that job existed
> at the time that its "singleton" was defined.
> 
> 
> Not saying that that's the intention of the dependency conditions,
> but maybe a not impossible interpretation.
> 
> 
> It might also be worth considering if a job submitted in a "held" state
> counts as "suspended" even though it was never "launched" and then "held"?
> 

Hi Kevin,

It may very well be a circular dependency. It feels like it with the way the two remaining jobs both are waiting for dependencies to be resolved.

Your interpretation makes sense. So just to make sure I got it right:
Job 25909273 finishes.
Job 25909274 is next due to singleton but cannot start due to the additional after:25909275 dependency.
Job 25909275 won't start because of the singleton dependency which is causing it to wait for 25909274 to finish.
The final state is that 25909274 waits for 25909275 and 25909275 waits for 25909274, i.e. circular.

I guess this raises the next question that you also touch upon:
Is it a bug in Slurm where the current behaviour is unintended or is it a matter of maybe clarifying the documentation?

Thanks,

Jarno

Jarno van der Kolk, PhD Phys.
Analyste principal en informatique scientifique | Senior Scientific Computing Specialist
Solutions TI | IT Solutions
Université d’Ottawa | University of Ottawa


More information about the slurm-users mailing list