[slurm-users] Re: Avoiding fragmentation

9 Apr 2024


      Hi Gerhard,
Gerhard Strangar via slurm-users slurm-users@lists.schedmd.com writes:
...
Hi,
I'm trying to figure out how to deal with a mix of few- and many-cpu
jobs. By that I mean most jobs use 128 cpus, but sometimes there are
jobs with only 16. As soon as that job with only 16 is running, the
scheduler splits the next 128 cpu jobs into 96+16 each, instead of
assigning a full 128 cpu node to them. Is there a way for the
administrator to achieve preferring full nodes?
The existence of pack_serial_at_end makes me believe there is not,
because that basically is what I needed, apart from my serial jobs using
16 cpus instead of 1.
Gerhard
This may well not be relevant for your case, but we actively discourage
the use of full nodes for the following reasons:
- When the cluster is full, which is most of the time, MPI jobs in
    general will start much faster if they don't specify the number of
    nodes and certainly don't request full nodes.  The overhead due to
    the jobs being scattered across nodes is often much lower than the
    additional waiting time incurred by requesting whole nodes.
- When all the cores of a node are requested, all the memory of the
    node becomes unavailable to other jobs, regardless of how much
    memory is requested or indeed how much is actually used.  This holds
    up jobs with low CPU but high memory requirements and thus reduces
    the total throughput of the system.
These factors are important for us because we have a large number of
single core jobs and almost all the users, whether doing MPI or not,
significantly overestimate the memory requirements of their jobs.
Cheers,
Loris
-- 
Dr. Loris Bennett (Herr/Mr)
FUB-IT (ex-ZEDAT), Freie Universität Berlin

2026

2025

2024

[slurm-users] Re: Avoiding fragmentation