[slurm-users] Elastic Compute

Mon Sep 10 06:47:59 MDT 2018

I think you probably want CR_LLN set in your SelectTypeParameters in
slurm.conf. This makes it fill up a node before moving on to the next
instead of "striping" the jobs across the nodes.
On Mon, Sep 10, 2018 at 8:29 AM Felix Wolfheimer
<f.wolfheimer at googlemail.com> wrote:
>
> No this happens without the "Oversubscribe" parameter being set. I'm using custom resources though:
>
> GresTypes=some_resource
>
> NodeName=compute-[1-100] CPUs=10 Gres=some_resource:10 State=CLOUD
>
> Submission uses:
>
> sbatch --nodes=1 --ntasks-per-node=1 --gres=some_resource:1
>
> But I just tried it without requesting this custom resource. It shows the same behavior, i.e., SLURM spins N nodes when I submit N jobs to the queue regardless what the resource request of each job is.
>
>
>
>
> Am Mo., 10. Sep. 2018 um 03:55 Uhr schrieb Brian Haymore <brian.haymore at utah.edu>:
>>
>> What do you have the OverSubscribe parameter set on the partition your using?
>>
>>
>> --
>> Brian D. Haymore
>> University of Utah
>> Center for High Performance Computing
>> 155 South 1452 East RM 405
>> Salt Lake City, Ut 84112
>> Phone: 801-558-1150, Fax: 801-585-5366
>> http://bit.ly/1HO1N2C
>>
>> ________________________________________
>> From: slurm-users [slurm-users-bounces at lists.schedmd.com] on behalf of Felix Wolfheimer [f.wolfheimer at googlemail.com]
>> Sent: Sunday, September 09, 2018 1:35 PM
>> To: slurm-users at lists.schedmd.com
>> Subject: [slurm-users] Elastic Compute
>>
>> I'm using the SLURM Elastic Compute feature and it works great in
>> general. However, I noticed that there's a bit of inefficiency in the
>> decision about the number of nodes which SLURM creates. Let's say I've
>> the following configuration
>>
>> NodeName=compute-[1-100] CPUs=10 State=CLOUD
>>
>> and there are none of these nodes up and running. Let's further say
>> that I create 10 identical jobs and submit them at the same time using
>>
>> sbatch --nodes=1 --ntasks-per-node=1
>>
>> I expected that SLURM finds out that 10 CPUs are required in total to
>> serve the requirements for all jobs and, thus, creates a single compute
>> node. However, SLURM triggers the creation of one node per job, i.e.,
>> 10 nodes are created. When the first of these ten nodes is ready to
>> accept jobs, SLURM assigns all of the 10 submitted jobs to this single
>> node, though. The other nine nodes which were created are running idle
>> and are terminated again after a while.
>>
>> I'm using "SelectType=select/cons_res" to schedule on the CPU level. Is
>> there some knob which influences this behavior or is this behavior
>> hard-coded?
>>
>>