[slurm-users] speed / efficiency of sacct vs. scontrol

Mon Feb 27 15:12:07 UTC 2023

Hi David,

David Laehnemann <david.laehnemann at hhu.de> writes:

> Dear Ward,
>
> if used correctly (and that is a big caveat for any method for
> interacting with a cluster system), snakemake will only submit as many
> jobs as can fit within the resources of the cluster at one point of
> time (or however much resources you tell snakemake that it can use). So
> unless there are thousands of cores available (or you "lie" to
> snakemake, telling it that there are much more cores than actually
> exist), it will only ever submit hundreds of jobs (or a lot less, if
> the jobs each require multiple cores). Accordingly, any queries will
> also only be for this number of jobs that snakemake currently has
> submitted. And snakemake will only submit new jobs, once it registers
> previously submitted jobs as finished.
>
> So workflow managers can actually help reduce the strain on the
> scheduler, by only ever submitting stuff within the general limits of
> the system (as opposed to, for example, using some bash loop to just
> submit all of your analysis steps or samples at once).

I don't see this as a particular advantage for the scheduler.  If the
maximum number of jobs a user can submit is to, say, 5000, then it makes
no difference whether these 5000 jobs are generated by snakemake or a
batch script.  On our system strain tends mainly to occur when many
similar jobs fail immediately after they have started.

How does snakemake behave in such a situation?  If the job database is
already clogged up trying to record too many jobs completing within too
short a time, snakemake querying the database at that moment and maybe
starting more jobs (because others have failed and thus completed) could
potentially exacerbate the problem.

> And for example,
> snakemake has a mechanism to batch a number of smaller jobs into larger
> jobs for submission on the cluster, so this might be something to
> suggest to your users that cause trouble through using snakemake
> (especially the `--group-components` mechanism):
> https://snakemake.readthedocs.io/en/latest/executing/grouping.html

This seems to me, from the perspective of an operator, to be the main
advantage.

> The query mechanism for job status is a different story. And I'm
> specifically here on this mailing list to get as much input as possible
> to improve this -- and welcome anybody who wants to chime in on my
> respective work-in-progress pull request right here:
> https://github.com/snakemake/snakemake/pull/2136
>
> And if you are seeing a workflow management system causing trouble on
> your system, probably the most sustainable way of getting this resolved
> is to file issues or pull requests with the respective project, with
> suggestions like the ones you made. For snakemake, a second good point
> to currently chime in, would be the issue discussing Slurm job array
> support: https://github.com/snakemake/snakemake/issues/301

I have to disagree here.  I think the onus is on the people in a given
community to ensure that their software behaves well on the systems they
want to use, not on the operators of those system.  Those of us running
HPC systems often have to deal with a very large range of different
pieces of software and time and personell are limited.  If some program
used by only a subset of the users is causing disruption, then it
already costs us time and energy to mitigate those effects.  Even if I
had the appropriate skill set, I don't see my self be writing many
patches for workflow managers any time soon.

Cheers,

Loris

> And for Nextflow, another commonly used workflow manager in my field
> (bioinformatics), there's also an issue discussing Slurm job array
> support:
> https://github.com/nextflow-io/nextflow/issues/1477
>
> cheers,
> david
>
>
> On Mon, 2023-02-27 at 13:24 +0100, Ward Poelmans wrote:
>> On 24/02/2023 18:34, David Laehnemann wrote:
>> > Those queries then should not have to happen too often, although do
>> > you
>> > have any indication of a range for when you say "you still wouldn't
>> > want to query the status too frequently." Because I don't really,
>> > and
>> > would probably opt for some compromise of every 30 seconds or so.
>> 
>> I think this is exactly why hpc sys admins are sometimes not very
>> happy about these tools. You're talking about 10000 of jobs on one
>> hand yet you want fetch the status every 30 seconds? What is the
>> point of that other then overloading the scheduler?
>> 
>> We're telling your users not to query the slurm too often and usually
>> give 5 minutes as a good interval. You have to let slurm do it's job.
>> There is no point in querying in a loop every 30 seconds when we're
>> talking about large numbers of jobs.
>> 
>> 
>> Ward
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin