[slurm-users] Configuration recommendations for heterogeneous cluster
Prentice Bisbal
pbisbal at pppl.gov
Tue Jan 22 20:49:52 UTC 2019
I left out a a *very* critical detail: One of the reasons I'm looking at
revamping my Slurm configuration is that my users have requested the
capability to submit long-running, low-priority interruptible jobs that
can be killed and requeued when shorter-running, higher-priority jobs
need to use the resources.
Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
http://www.pppl.gov
On 1/22/19 3:38 PM, Prentice Bisbal wrote:
> Slurm Users,
>
> I would like your input on the best way to configure Slurm for a
> heterogeneous cluster I am responsible for. This e-mail will probably
> be a bit long to include all the necessary details of my environment
> so thanks in advance to those of you who read all of it!
>
> The cluster I support is a very heterogeneous cluster with several
> different network technologies and generations of processors. Although
> some people here refer to this cluster as numerous l different
> clusters, in reality it is one cluster, since all the nodes have their
> work assigned to them from a single Slurm Controller, all the nodes
> use the same executables installed on a shared drive, and all nodes
> are diskless and use the same NFSroot OS image, so they are all
> configured 100% alike.
>
> The cluster has been built piece-meal over a number of years, which
> explains the variety of hardware/networking in use. In Slurm, each of
> the different "clusters" is a separate partition intended to serve
> different purposes:
>
> Partition "E" - AMD Opteron 6320 processors, 64 GB RAM/node, 1 GbE,
> meant for serial, and low task count parallel jobs that only use a few
> cores and stay within a single node. Limited to 16 tasks or less in QOS
>
> Partition "D" - AMD Opteron 6136, 6274, and 6376 processors, 32 GB or
> 64 GB RAM per node, 10 GbE, meant for general-purpose parallel jobs
> spanning multiple nodes. Min. Task count of 32 tasks to prevent
> smaller jobs that should be run on Partition E from running here.
>
> Partition "K" - AMD Opteron 6274 and 6376 processors, 64 GB RAM per
> node, DDR IB network, meant for tightly-coupled parallel jobs
>
> Partition "G1" - AMD Opteron 6274, 6276, 6376, and Intel Xeon E5-2698
> v3 & E5-2630 v3 processors, RAM ranging from 128 GB - 512 GB per
> node, 1 GbE Network, meant for "large memory" jobs - some nodes are in
> different racks attached to different switches, so not really optimal
> for multi-node jobs.
>
> Partition "J" - AMD Opteron 6136 Processors, 280 GB RAM per node, DDR
> IB, was orginally meant for a specific project, I now need to allow
> general access to it.
>
> Partition "G2" - AMD Opteron 6136, 6274, and 6320 processors, 32 GB,
> 96 GB, and 128 GB RAM per node, IB network , access is restricted to
> specific users/projects.
>
> Partition "M" - Intel Xeon E5-2698 v3 and E5-2697A v4 processors, 128
> GB RAM per node, 1 GbE network, reserved for running 1 specific
> simulation application.
>
> To make all this work so far, I have created a job_submit.lua script
> with numerous checks and conditionals that has become quite unwieldy.
> As a result, changes that should be simple take a considerable amount
> of time for me to rewrite and test the script. On top of that, almost
> all of the logic in that script is logic that Slurm can already
> perform in a more easily manageable way. I've essentially re-invented
> wheels that Slurm already provides.
>
> Further, each partition has it's own QOS, so my job_submit.lua assigns
> each job to a specific partition and QOS depending on it's resource
> requirements. This means that a job may be assigned to D, but could
> also run on K if K is idle , will never be able to run on K. This
> means cluster nodes could go unutilized, reducing cluster utilization
> states (which management looks at), and increasing job queue time
> (which users are obsessed with).
>
> I would like to simplify this configuration as much as possible to
> reduce the labor it takes me to maintain my job_submit.lua script, and
> therefore make me more responsive to meeting my users needs, and
> increase cluster utilization. Since I have numerous different
> networks, I was thinking the I could use the topology,conf file to
> keep jobs on a single network, and prevent multi-node jobs run on
> partition E. The partitions reserved for specific
> projects/departments would still need to be requested explicitly.
>
> At first, I was going to take this approach:
>
> 1. Create a single partition with all the general access nodes
>
> 2. Create a topology.conf file to make sure jobs stay within a single
> network.
>
> 3. Assign weights to the different partitions to that Slurm will try
> to assign jobs to them in a specific order of preference
>
> 4. Assign weights to the different nodes, so that the nodes with the
> fastest processors are preferred.
>
> After getting responses to my questions about the topology.conf file,
> this seems like this approach may not be viable, or at least not be
> best procedure.
>
> I'm am now considering this:
>
> 0. Restrict access to the non-general access partitions (this is
> already done for the most part, hence step 0).
>
> 1. Assign each Partition it's own QOS in the slurm.conf file.
>
> 2. Assign a weight to the partitions so Slurm attempts to assign jobs
> to them in a specific order.
>
> 3. Assign weights to the nodes so the nodes are assigned in a specific
> order (faster processors first)
>
> 4. Set job_submit plugin to all_partitions, or partition
>
>
> Step 4 in this case is the area I'm the least familiar with. One of
> the reasons we are using a job_submit.lua script is because users will
> often request partitions that are inappropriate for their job needs
> (like trying to run a job that spans multiple nodes on a partition
> with only 1 GbE, or request partition G because it's free, but their
> job only uses 1 MB of RAM). I'm also not sure if I want to give up
> using job_submit.lua 100% by switching job_submit_plugin to "partition"
>
> My ultimate goal is to have users specify what resources they need
> without specifying a QOS or Partition,and let Slurm handle that
> automatically based on the weights I assign to the nodes and
> partitions. I also don't want to lock a job to a specific partition
> at submit time so Slurm can allocate it to idle nodes in a different
> partition of that partition has idle nodes when the job is finally
> eligible to run.
>
> What is the best way to achieve my goals? All suggestions will be
> considered.
>
> For those of you who made it this far, thanks!
>
> Prentice
>
>
>
More information about the slurm-users
mailing list