[slurm-users] snakemake and slurm in general - correction

Loris Bennett loris.bennett at fu-berlin.de
Fri Feb 24 08:22:33 UTC 2023


Loris Bennett <loris.bennett at fu-berlin.de> writes:

> Hi David,
>
> (Thanks for changing the subject to something more appropriate).
>
> David Laehnemann <david.laehnemann at hhu.de> writes:
>
>> Yes, but only to an extent. The linked conversation ends with this:
>>
>>>> Do you have any best practice about setting MaxJobCount to a proper
>> number?
>>
>>> That depends upon your workload. You could probably set MaxJobCount
>> to at least 50000 with most systems (assuming you have at least a few
>> gigabytes of memory). Some sites run with a value of 1000000 or more.
>>
>> So, it is configurable. But this has a limit. And if you have lots of
>> users on a system submitting lots of jobs, even a value of 1000000 can
>> get exhausted.
>
> Yes, but start a lot more jobs and stay within the limit if you use jobs

  .. but you can start ...

> arrays.  When you submit individual jobs, a job ID for each one needs to
> be written to the Slurm job database.  This can cause the database to
> become unresponsive if the number submitted at one time, whether by
> snakemake or just a bash script looping over 'sbatch', is too high.  If,
> on the other hand, you submit a job array, only one entry needs to be
> made in the database immediately, with entries for the elements of the
> array only being made when a job can actually start.
>
> This is why a large number of individual jobs with the same resource
> requirements prevents backfill from working properly.  The mechanism
> only considers a certain (configurable) number of pending jobs to see
> whether they qualify for backfilling.  In this context, a job array is
> counted as a single job, regardless of how large the array actually is.
> This will degrade the throughput of the system and thus negatively
> impact all users.  Therefore, on our system we would not allow users to
> employ an mechanism which generates a large number jobs but does not
> employ job arrays.
>
>> And in either case, this is not something that speaks against a
>> workflow management system giving you additional control over things.
>> So I'm not sure what exactly we are arguing about, right here...
>
> I just wanted to point out that, whereas for some user approaches
> such as snakemake obviously scratch a very important itch, for people
> running HPC systems, and indeed for users who don't use such mechanisms,
> they may cause issues. 
>
> Cheers,
>
> Loris
>
>> cheers,
>> david
>>
>>
>>
>> On Thu, 2023-02-23 at 17:41 +0100, Ole Holm Nielsen wrote:
>>> On 2/23/23 17:07, David Laehnemann wrote:
>>> > In addition, there are very clear limits to how many jobs slurm can
>>> > handle in its queue, see for example this discussion:
>>> > https://bugs.schedmd.com/show_bug.cgi?id=2366
>>> 
>>> My 2 cents: Slurm's job limits are configurable, see this Wiki page:
>>> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#maxjobcount-limit
>>> 
>>> /Ole
>>> 
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin



More information about the slurm-users mailing list