I think all of the replies point to --exclusive being your best solution (only solution?).

You need to know exactly the maximum number of cores a particular application or applications will use. Then you allow other applications to use the unused cores. Otherwise, at some point when the applications are running, they are going to use the same core and you could have problems. I don't know of any way you can allow one application to use more cores than it was allocated without the possibility of multiple applications using the same cores.

Fundamentally you should not have one application using a variable number of cores with a second application also using the same cores. (IMHO)

As everyone has said, your best bet is to use --exclusive and allow an application to have access to all of the cores even if they don't use all of them all the time.

Good luck.

Jeff

P.S. Someone mentioned watching memory usage on the node. That too is important if you do not use --exclusive. Otherwise Mr. OOM will come to visit (the Out Of Memory daemon that starts killing process). In my experience, the OOM kills HPC processes first because they use most of the memory and most of the CPU time.

On Thu, Aug 1, 2024 at 4:06 PM Henrique Almeida via slurm-users <slurm-users@lists.schedmd.com> wrote:

Laura, yes, as long as there's around 10 GB of RAM available, and
ideally at least 5 harts too, but I expect 50 most of the time, not 5.

On Thu, Aug 1, 2024 at 4:28 PM Laura Hild <lsh@jlab.org> wrote:
>
> So you're wanting that, instead of waiting for the task to finish and then running on the whole node, that the job should run immediately on n-1 CPUs? If there were only one CPU available in the entire cluster, would you want the job to start running immediately on one CPU instead of waiting for more?
>

--
Henrique Dante de Almeida
hdante@gmail.com

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com