[slurm-users] Elastic Compute

Mon Sep 10 06:26:27 MDT 2018

No this happens without the "Oversubscribe" parameter being set. I'm using
custom resources though:

GresTypes=some_resource

NodeName=compute-[1-100] CPUs=10 Gres=some_resource:10 State=CLOUD

Submission uses:

sbatch --nodes=1 --ntasks-per-node=1 --gres=some_resource:1

But I just tried it without requesting this custom resource. It shows the
same behavior, i.e., SLURM spins N nodes when I submit N jobs to the queue
regardless what the resource request of each job is.

Am Mo., 10. Sep. 2018 um 03:55 Uhr schrieb Brian Haymore <
brian.haymore at utah.edu>:

> What do you have the OverSubscribe parameter set on the partition your
> using?
>
>
> --
> Brian D. Haymore
> University of Utah
> Center for High Performance Computing
> 155 South 1452 East RM 405
> Salt Lake City, Ut 84112
> Phone: 801-558-1150, Fax: 801-585-5366
> http://bit.ly/1HO1N2C
>
> ________________________________________
> From: slurm-users [slurm-users-bounces at lists.schedmd.com] on behalf of
> Felix Wolfheimer [f.wolfheimer at googlemail.com]
> Sent: Sunday, September 09, 2018 1:35 PM
> To: slurm-users at lists.schedmd.com
> Subject: [slurm-users] Elastic Compute
>
> I'm using the SLURM Elastic Compute feature and it works great in
> general. However, I noticed that there's a bit of inefficiency in the
> decision about the number of nodes which SLURM creates. Let's say I've
> the following configuration
>
> NodeName=compute-[1-100] CPUs=10 State=CLOUD
>
> and there are none of these nodes up and running. Let's further say
> that I create 10 identical jobs and submit them at the same time using
>
> sbatch --nodes=1 --ntasks-per-node=1
>
> I expected that SLURM finds out that 10 CPUs are required in total to
> serve the requirements for all jobs and, thus, creates a single compute
> node. However, SLURM triggers the creation of one node per job, i.e.,
> 10 nodes are created. When the first of these ten nodes is ready to
> accept jobs, SLURM assigns all of the 10 submitted jobs to this single
> node, though. The other nine nodes which were created are running idle
> and are terminated again after a while.
>
> I'm using "SelectType=select/cons_res" to schedule on the CPU level. Is
> there some knob which influences this behavior or is this behavior
> hard-coded?
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180910/7dc04009/attachment.html>