Agree with that. Plus, of course, even if the jobs run a bit slower by not having all the cores on a single node, they will be scheduled sooner, so the overall turnaround time for the user will be better, and ultimately that's what they care about. I've always been of the view, for any scheduler, that the less you try to constrain it the better. It really depends on what you're trying to optimise for, but generally speaking I try to optimise for maximum utilisation and throughput, unless I have a specific business case that needs to prioritise particular workloads, and then I'll compromise on throughput to get the urgent workload through sooner.
Tun ________________________________ From: Loris Bennett via slurm-users slurm-users@lists.schedmd.com Sent: 09 April 2024 06:51 To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Cc: Gerhard Strangar g.s@arcor.de Subject: [slurm-users] Re: Avoiding fragmentation
Hi Gerhard,
Gerhard Strangar via slurm-users slurm-users@lists.schedmd.com writes:
Hi,
I'm trying to figure out how to deal with a mix of few- and many-cpu jobs. By that I mean most jobs use 128 cpus, but sometimes there are jobs with only 16. As soon as that job with only 16 is running, the scheduler splits the next 128 cpu jobs into 96+16 each, instead of assigning a full 128 cpu node to them. Is there a way for the administrator to achieve preferring full nodes? The existence of pack_serial_at_end makes me believe there is not, because that basically is what I needed, apart from my serial jobs using 16 cpus instead of 1.
Gerhard
This may well not be relevant for your case, but we actively discourage the use of full nodes for the following reasons:
- When the cluster is full, which is most of the time, MPI jobs in general will start much faster if they don't specify the number of nodes and certainly don't request full nodes. The overhead due to the jobs being scattered across nodes is often much lower than the additional waiting time incurred by requesting whole nodes.
- When all the cores of a node are requested, all the memory of the node becomes unavailable to other jobs, regardless of how much memory is requested or indeed how much is actually used. This holds up jobs with low CPU but high memory requirements and thus reduces the total throughput of the system.
These factors are important for us because we have a large number of single core jobs and almost all the users, whether doing MPI or not, significantly overestimate the memory requirements of their jobs.
Cheers,
Loris
-- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com ________________________________
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.comhttps://www.astrazeneca.com