[slurm-users] Configuration recommendations for heterogeneous cluster

Tue Jan 22 20:38:07 UTC 2019

Slurm Users,

I would like your input on the best way to configure Slurm for a 
heterogeneous cluster I am responsible for. This e-mail will probably be 
a bit long to include all the necessary details of my environment so 
thanks in advance to those of you who read all of it!

The cluster I support is a very heterogeneous cluster with several 
different network technologies and generations of processors. Although 
some people here refer to this cluster as numerous l different clusters, 
in reality it is one cluster, since all the nodes have their work 
assigned to them from a single Slurm Controller, all the nodes use the 
same executables installed on a shared drive, and all nodes are diskless 
and use the same NFSroot OS image, so they are all configured 100% alike.

The cluster has been built piece-meal over a number of years, which 
explains the variety of hardware/networking in use. In Slurm, each of 
the different "clusters" is a separate partition intended to serve 
different purposes:

Partition "E" - AMD Opteron 6320 processors, 64 GB RAM/node, 1 GbE, 
meant for serial, and low task count parallel jobs that only use a few 
cores and stay within a single node. Limited to 16 tasks or less in QOS

Partition "D" - AMD Opteron 6136, 6274, and 6376 processors, 32 GB or 64 
GB RAM per node, 10 GbE, meant for general-purpose parallel jobs 
spanning multiple nodes. Min. Task count of 32 tasks to prevent smaller 
jobs that should be run on Partition E from running here.

Partition "K"  - AMD Opteron 6274 and 6376 processors, 64 GB RAM per 
node, DDR IB network, meant for tightly-coupled parallel jobs

Partition "G1" - AMD Opteron 6274, 6276, 6376, and Intel Xeon E5-2698 v3 
&  E5-2630 v3 processors, RAM ranging from 128 GB - 512 GB per node, 1 
GbE Network, meant for "large memory" jobs - some nodes are in different 
racks attached to different switches, so not really optimal for 
multi-node jobs.

Partition "J" -  AMD Opteron 6136 Processors, 280 GB RAM per node, DDR 
IB, was orginally meant for a specific project, I now need to allow 
general access to it.

Partition "G2" - AMD Opteron 6136, 6274, and 6320 processors, 32 GB, 96 
GB, and 128 GB RAM per node, IB network , access is restricted to 
specific users/projects.

Partition "M" - Intel Xeon E5-2698 v3 and E5-2697A v4 processors,  128 
GB RAM per node, 1 GbE network, reserved for running 1 specific 
simulation application.

To make all this work so far, I have created a job_submit.lua script 
with numerous checks and conditionals that has become quite unwieldy. As 
a result, changes that should be simple take a considerable amount of 
time for me to rewrite and test the script. On top of that, almost all 
of the logic in that script is logic that Slurm can already perform in a 
more easily manageable way. I've essentially re-invented wheels that 
Slurm already provides.

Further, each partition has it's own QOS, so my job_submit.lua assigns 
each job to a specific partition and QOS depending on it's resource 
requirements. This means that a job may be assigned to D, but could  
also run on K if K is idle , will never be able to run on K. This means 
cluster nodes could go unutilized, reducing cluster utilization states 
(which management looks at), and increasing job queue time (which users 
are obsessed with).

I would like to simplify this configuration as much as possible to 
reduce the labor it takes me to maintain my job_submit.lua script, and 
therefore make me more responsive to meeting my users needs, and 
increase cluster utilization. Since I have numerous different networks, 
I was thinking the I could use the topology,conf file to keep jobs on a 
single network, and prevent multi-node jobs run on partition E.  The 
partitions reserved for specific projects/departments would still need 
to be requested explicitly.

At first, I was going to take this approach:

1. Create a single partition with all the general access nodes

2. Create a topology.conf file to make sure jobs stay within a single 
network.

3. Assign weights to the different partitions to that Slurm will try to 
assign jobs to them in a specific order of preference

4. Assign weights to the different nodes, so that the nodes with the 
fastest processors are preferred.

After getting responses to my questions about the topology.conf file, 
this seems like this approach may not be viable, or at least not be best 
procedure.

I'm am now considering this:

0. Restrict access to the non-general access partitions (this is already 
done for the most part, hence step 0).

1. Assign each Partition it's own QOS in the slurm.conf file.

2. Assign a weight to the partitions so Slurm attempts to assign jobs to 
them in a specific order.

3. Assign weights to the nodes so the nodes are assigned in a specific 
order (faster processors first)

4. Set job_submit plugin to all_partitions, or partition

Step 4 in this case is the area I'm the least familiar with. One of the 
reasons we are using a job_submit.lua script is because users will often 
request partitions that are inappropriate for their job needs (like 
trying to run a job that spans multiple nodes on a partition with only 1 
GbE, or request partition G because it's free, but their job only uses 1 
MB of RAM). I'm also not sure if I want to give up using job_submit.lua 
100%  by switching job_submit_plugin to "partition"

My ultimate goal is to have users specify what resources they need 
without specifying a QOS or Partition,and let Slurm handle that 
automatically based on the weights I assign to the nodes and 
partitions.  I also don't want to lock a job to a specific partition at 
submit time so Slurm can allocate it to idle nodes in a different 
partition of that partition has idle nodes when the job is finally 
eligible to run.

What is the best way to achieve my goals? All suggestions will be 
considered.

For those of you who made it this far, thanks!

Prentice

-- 
Prentice