[slurm-users] snakemake and slurm in general

Loris Bennett loris.bennett at fu-berlin.de
Fri Feb 24 07:17:47 UTC 2023


Hi David,

(Thanks for changing the subject to something more appropriate).

David Laehnemann <david.laehnemann at hhu.de> writes:

> Yes, but only to an extent. The linked conversation ends with this:
>
>>> Do you have any best practice about setting MaxJobCount to a proper
> number?
>
>> That depends upon your workload. You could probably set MaxJobCount
> to at least 50000 with most systems (assuming you have at least a few
> gigabytes of memory). Some sites run with a value of 1000000 or more.
>
> So, it is configurable. But this has a limit. And if you have lots of
> users on a system submitting lots of jobs, even a value of 1000000 can
> get exhausted.

Yes, but start a lot more jobs and stay within the limit if you use jobs
arrays.  When you submit individual jobs, a job ID for each one needs to
be written to the Slurm job database.  This can cause the database to
become unresponsive if the number submitted at one time, whether by
snakemake or just a bash script looping over 'sbatch', is too high.  If,
on the other hand, you submit a job array, only one entry needs to be
made in the database immediately, with entries for the elements of the
array only being made when a job can actually start.

This is why a large number of individual jobs with the same resource
requirements prevents backfill from working properly.  The mechanism
only considers a certain (configurable) number of pending jobs to see
whether they qualify for backfilling.  In this context, a job array is
counted as a single job, regardless of how large the array actually is.
This will degrade the throughput of the system and thus negatively
impact all users.  Therefore, on our system we would not allow users to
employ an mechanism which generates a large number jobs but does not
employ job arrays.

> And in either case, this is not something that speaks against a
> workflow management system giving you additional control over things.
> So I'm not sure what exactly we are arguing about, right here...

I just wanted to point out that, whereas for some user approaches
such as snakemake obviously scratch a very important itch, for people
running HPC systems, and indeed for users who don't use such mechanisms,
they may cause issues. 

Cheers,

Loris

> cheers,
> david
>
>
>
> On Thu, 2023-02-23 at 17:41 +0100, Ole Holm Nielsen wrote:
>> On 2/23/23 17:07, David Laehnemann wrote:
>> > In addition, there are very clear limits to how many jobs slurm can
>> > handle in its queue, see for example this discussion:
>> > https://bugs.schedmd.com/show_bug.cgi?id=2366
>> 
>> My 2 cents: Slurm's job limits are configurable, see this Wiki page:
>> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#maxjobcount-limit
>> 
>> /Ole
>> 
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin



More information about the slurm-users mailing list