[slurm-users] speed / efficiency of sacct vs. scontrol

David Laehnemann david.laehnemann at hhu.de
Thu Feb 23 13:48:55 UTC 2023


Hi Sean, hi everybody,

thanks a lot for the quick insights!

My takeaway is: sacct is the better default for putting in lots of job
status checks after all, as it will not impact the slurmctld scheduler.

Quick follow-up question: do you have any indication of the rate of job
status checks via sacct that slurmdbd will gracefully handle (per
second)? Or any suggestions how to roughly determine such a rate for a
given cluster system?

cheers,
david


P.S.: @Loris and @Noam: Exactly, snakemake is a software distinct from
slurm that you can use to orchestrate large analysis workflows---on
anything from a desktop or laptop computer to all kinds of cluster /
cloud systems. In the case of Slurm it will submit each analysis step
on a particular sample as a separate job, specifying the resources it
needs. The scheduler then handles it from there. But because you can
have (hundreds of) thousands of jobs, and with dependencies among them,
you can't just submit everything all at once, but have to keep track of
where you are at. And make sure you don't submit much more than the
system can handle at any time, so you don't overwhelm the Slurm queue.



On Thu, 2023-02-23 at 07:55 -0500, Sean Maxwell wrote:
> Hi David,
> 
> scontrol - interacts with slurmctld using RPC, so it is faster, but
> requests put load on the scheduler itself.
> sacct - interacts with slurmdbd, so it doesn't place additional load
> on the
> scheduler.
> 
> There is a balance to reach, but the scontrol approach is riskier and
> can
> start to interfere with the cluster operation if used incorrectly.
> 
> Best,
> 
> -Sean
> 
> On Thu, Feb 23, 2023 at 5:59 AM David Laehnemann <
> david.laehnemann at hhu.de>
> wrote:
> 
> > Dear Slurm users and developers,
> > 
> > TL;DR:
> > Do any of you know if `scontrol` status checks of jobs are always
> > expected to be quicker than `sacct` job status checks? Do you have
> > any
> > comparative timings between the two commands?
> > And consequently, would using `scontrol` thus be the better default
> > option (as opposed to `sacct`) for repeated job status checks by a
> > workflow management system?
> > 
> > 
> > And here's the long version with background infos and linkouts:
> > 
> > I have recently started using a Slurm cluster and am a regular user
> > of
> > the workflow management system snakemake (
> > https://snakemake.readthedocs.io/en/latest/). This workflow manager
> > recently integrated support for running analysis workflows pretty
> > seamlessly on Slurm clusters. It takes care of managing all job
> > dependecies and handles the submission of jobs according to your
> > global
> > (and job-specific) resource configurations.
> > 
> > One little hiccup when starting to use the snakemake-Slurm
> > combination
> > was a snakemake-internal rate-limitation for checking job statuses.
> > You
> > can find the full story here:
> > https://github.com/snakemake/snakemake/pull/2136
> > 
> > For debugging this, I obtained timings on `sacct` and `scontrol`,
> > with
> > `scontrol` consistently about 2.5x quicker in returning the job
> > status
> > when compared to `sacct`. Timings are recorded here:
> > 
> > https://github.com/snakemake/snakemake/blob/b91651d5ea2314b954a3b4b096d7f327ce743b94/snakemake/scheduler.py#L199-L210
> > 
> > However, currently `sacct` is used for regularly checking the
> > status of
> > submitted jobs per default, and `scontrol` is only a fallback
> > whenever
> > `sacct` doesn't find the job (for example because it is not yet
> > running). Now, I was wondering if switching the default to
> > `scontrol`
> > would make sense. Thus, I would like to ask:
> > 
> > 1) Slurm users, whether they also have similar timings on different
> > Slurm clusters and whether those confirm that `scontrol` is
> > consistently quicker?
> > 
> > 2) Slurm developers, whether `scontrol` is expected to be quicker
> > from
> > its implementation and whether using `scontrol` would also be the
> > option that puts less strain on the scheduler in general?
> > 
> > Many thanks and best regards,
> > David
> > 
> > 
> > 




More information about the slurm-users mailing list