[slurm-users] Configuration recommendations for heterogeneous cluster

Tue Jan 22 20:49:52 UTC 2019

I left out a a *very* critical detail: One of the reasons I'm looking at 
revamping my Slurm configuration is that my users have requested the 
capability to submit long-running, low-priority interruptible jobs that 
can be killed and requeued when shorter-running, higher-priority jobs 
need to use the resources.

Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
http://www.pppl.gov

On 1/22/19 3:38 PM, Prentice Bisbal wrote:
> Slurm Users,
>
> I would like your input on the best way to configure Slurm for a 
> heterogeneous cluster I am responsible for. This e-mail will probably 
> be a bit long to include all the necessary details of my environment 
> so thanks in advance to those of you who read all of it!
>
> The cluster I support is a very heterogeneous cluster with several 
> different network technologies and generations of processors. Although 
> some people here refer to this cluster as numerous l different 
> clusters, in reality it is one cluster, since all the nodes have their 
> work assigned to them from a single Slurm Controller, all the nodes 
> use the same executables installed on a shared drive, and all nodes 
> are diskless and use the same NFSroot OS image, so they are all 
> configured 100% alike.
>
> The cluster has been built piece-meal over a number of years, which 
> explains the variety of hardware/networking in use. In Slurm, each of 
> the different "clusters" is a separate partition intended to serve 
> different purposes:
>
> Partition "E" - AMD Opteron 6320 processors, 64 GB RAM/node, 1 GbE, 
> meant for serial, and low task count parallel jobs that only use a few 
> cores and stay within a single node. Limited to 16 tasks or less in QOS
>
> Partition "D" - AMD Opteron 6136, 6274, and 6376 processors, 32 GB or 
> 64 GB RAM per node, 10 GbE, meant for general-purpose parallel jobs 
> spanning multiple nodes. Min. Task count of 32 tasks to prevent 
> smaller jobs that should be run on Partition E from running here.
>
> Partition "K"  - AMD Opteron 6274 and 6376 processors, 64 GB RAM per 
> node, DDR IB network, meant for tightly-coupled parallel jobs
>
> Partition "G1" - AMD Opteron 6274, 6276, 6376, and Intel Xeon E5-2698 
> v3 &  E5-2630 v3 processors, RAM ranging from 128 GB - 512 GB per 
> node, 1 GbE Network, meant for "large memory" jobs - some nodes are in 
> different racks attached to different switches, so not really optimal 
> for multi-node jobs.
>
> Partition "J" -  AMD Opteron 6136 Processors, 280 GB RAM per node, DDR 
> IB, was orginally meant for a specific project, I now need to allow 
> general access to it.
>
> Partition "G2" - AMD Opteron 6136, 6274, and 6320 processors, 32 GB, 
> 96 GB, and 128 GB RAM per node, IB network , access is restricted to 
> specific users/projects.
>
> Partition "M" - Intel Xeon E5-2698 v3 and E5-2697A v4 processors, 128 
> GB RAM per node, 1 GbE network, reserved for running 1 specific 
> simulation application.
>
> To make all this work so far, I have created a job_submit.lua script 
> with numerous checks and conditionals that has become quite unwieldy. 
> As a result, changes that should be simple take a considerable amount 
> of time for me to rewrite and test the script. On top of that, almost 
> all of the logic in that script is logic that Slurm can already 
> perform in a more easily manageable way. I've essentially re-invented 
> wheels that Slurm already provides.
>
> Further, each partition has it's own QOS, so my job_submit.lua assigns 
> each job to a specific partition and QOS depending on it's resource 
> requirements. This means that a job may be assigned to D, but could  
> also run on K if K is idle , will never be able to run on K. This 
> means cluster nodes could go unutilized, reducing cluster utilization 
> states (which management looks at), and increasing job queue time 
> (which users are obsessed with).
>
> I would like to simplify this configuration as much as possible to 
> reduce the labor it takes me to maintain my job_submit.lua script, and 
> therefore make me more responsive to meeting my users needs, and 
> increase cluster utilization. Since I have numerous different 
> networks, I was thinking the I could use the topology,conf file to 
> keep jobs on a single network, and prevent multi-node jobs run on 
> partition E.  The partitions reserved for specific 
> projects/departments would still need to be requested explicitly.
>
> At first, I was going to take this approach:
>
> 1. Create a single partition with all the general access nodes
>
> 2. Create a topology.conf file to make sure jobs stay within a single 
> network.
>
> 3. Assign weights to the different partitions to that Slurm will try 
> to assign jobs to them in a specific order of preference
>
> 4. Assign weights to the different nodes, so that the nodes with the 
> fastest processors are preferred.
>
> After getting responses to my questions about the topology.conf file, 
> this seems like this approach may not be viable, or at least not be 
> best procedure.
>
> I'm am now considering this:
>
> 0. Restrict access to the non-general access partitions (this is 
> already done for the most part, hence step 0).
>
> 1. Assign each Partition it's own QOS in the slurm.conf file.
>
> 2. Assign a weight to the partitions so Slurm attempts to assign jobs 
> to them in a specific order.
>
> 3. Assign weights to the nodes so the nodes are assigned in a specific 
> order (faster processors first)
>
> 4. Set job_submit plugin to all_partitions, or partition
>
>
> Step 4 in this case is the area I'm the least familiar with. One of 
> the reasons we are using a job_submit.lua script is because users will 
> often request partitions that are inappropriate for their job needs 
> (like trying to run a job that spans multiple nodes on a partition 
> with only 1 GbE, or request partition G because it's free, but their 
> job only uses 1 MB of RAM). I'm also not sure if I want to give up 
> using job_submit.lua 100%  by switching job_submit_plugin to "partition"
>
> My ultimate goal is to have users specify what resources they need 
> without specifying a QOS or Partition,and let Slurm handle that 
> automatically based on the weights I assign to the nodes and 
> partitions.  I also don't want to lock a job to a specific partition 
> at submit time so Slurm can allocate it to idle nodes in a different 
> partition of that partition has idle nodes when the job is finally 
> eligible to run.
>
> What is the best way to achieve my goals? All suggestions will be 
> considered.
>
> For those of you who made it this far, thanks!
>
> Prentice
>
>
>