[slurm-users] Elastic Compute
Brian Haymore
brian.haymore at utah.edu
Mon Sep 10 08:52:27 MDT 2018
I believe the default value of this would prevent jobs from sharing a node. You may want to look at this and change it from the default.
--
Brian D. Haymore
University of Utah
Center for High Performance Computing
155 South 1452 East RM 405
Salt Lake City, Ut 84112
Phone: 801-558-1150, Fax: 801-585-5366
http://bit.ly/1HO1N2C
On Sep 10, 2018 6:30 AM, Felix Wolfheimer <f.wolfheimer at googlemail.com> wrote:
No this happens without the "Oversubscribe" parameter being set. I'm using custom resources though:
GresTypes=some_resource
NodeName=compute-[1-100] CPUs=10 Gres=some_resource:10 State=CLOUD
Submission uses:
sbatch --nodes=1 --ntasks-per-node=1 --gres=some_resource:1
But I just tried it without requesting this custom resource. It shows the same behavior, i.e., SLURM spins N nodes when I submit N jobs to the queue regardless what the resource request of each job is.
Am Mo., 10. Sep. 2018 um 03:55 Uhr schrieb Brian Haymore <brian.haymore at utah.edu<mailto:brian.haymore at utah.edu>>:
What do you have the OverSubscribe parameter set on the partition your using?
--
Brian D. Haymore
University of Utah
Center for High Performance Computing
155 South 1452 East RM 405
Salt Lake City, Ut 84112
Phone: 801-558-1150, Fax: 801-585-5366
http://bit.ly/1HO1N2C
________________________________________
From: slurm-users [slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>] on behalf of Felix Wolfheimer [f.wolfheimer at googlemail.com<mailto:f.wolfheimer at googlemail.com>]
Sent: Sunday, September 09, 2018 1:35 PM
To: slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>
Subject: [slurm-users] Elastic Compute
I'm using the SLURM Elastic Compute feature and it works great in
general. However, I noticed that there's a bit of inefficiency in the
decision about the number of nodes which SLURM creates. Let's say I've
the following configuration
NodeName=compute-[1-100] CPUs=10 State=CLOUD
and there are none of these nodes up and running. Let's further say
that I create 10 identical jobs and submit them at the same time using
sbatch --nodes=1 --ntasks-per-node=1
I expected that SLURM finds out that 10 CPUs are required in total to
serve the requirements for all jobs and, thus, creates a single compute
node. However, SLURM triggers the creation of one node per job, i.e.,
10 nodes are created. When the first of these ten nodes is ready to
accept jobs, SLURM assigns all of the 10 submitted jobs to this single
node, though. The other nine nodes which were created are running idle
and are terminated again after a while.
I'm using "SelectType=select/cons_res" to schedule on the CPU level. Is
there some knob which influences this behavior or is this behavior
hard-coded?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180910/84f93493/attachment-0001.html>
More information about the slurm-users
mailing list