[slurm-users] Configuration recommendations for heterogeneous cluster
Prentice Bisbal
pbisbal at pppl.gov
Mon Jan 28 19:23:59 UTC 2019
This is my solution to this problem so far:
1. Create a topology file so that Slurm will not place jobs across two
different message-passing networks.
2. Create partition called "general" for all the general-access nodes in
my environment.
3. Create a partition that is a duplicate of "general" called
"interruptible" that has a lower priority that "general" and the
existing, soon-to-be legacy partitions, so jobs submitted to this
partition will be subordinate to all other jobs.
4. I assigned weights to the different networks so that the 1 GbE nodes
are assigned first, than the 10GbE network, and then the IB network,
then the the large memory nodes, and finally the nodes with large memory
and IB.
5. I further refined weights so that nodes with newer/faster processors
in each network get assigned first, and the older/slower ones last.
6. My cluster has a mix of K10, Bulldozer and Piledriver Opterons. I
added features to the node definitions for the type of processors, with
the newer processors also having the earlier architectures as features,
so if someone specifies the bulldozer constraint, it will run on
bulldozer and piledriver processors, but not K10 processors.
In my initial testing, this seems to work as desired. I have a user who
is really eager to submit pre-emptible jobs testing and providing
feedback now.
State tuned for more updates...
Prentice
On 1/23/19 5:35 PM, Cyrus Proctor wrote:
> Prentice,
>
> So, perhaps more like an a la carte menu? I could see having the
> job_submit.lua plugin block submission unless specific constraint
> classes are defined. Pair that with a QOS that a user needs to select
> and you can (almost) do away partitions. You could have a chip class
> (amd, intel), a network class (1GbE, 10GbE, FDR), and a use class
> (serial, parallel, large-mem, etc.). Then, a user must specify at least
> 1 of each class plus a relevant QOS to submit.
>
> If you do explore SPANK plugins, you might have a look at:
> https://github.com/grondo/slurm-spank-plugins and
> https://slurm.schedmd.com/spank.html and here
> https://github.com/SchedMD/slurm/blob/master/slurm/spank.h . The last
> one (to me) actually succinctly explains more about how everything in
> Slurm is put together than I've seen pretty much anywhere else. The
> barrier to entry is decent but ultimately this is the most powerful.
>
> One other way I've explored is the use of licenses. You can concoct any
> number of interesting uses where you have a resource pool that could be
> finite (e.g. MATLAB license seats or some such), or infinite. It is
> fairly straightforward to engineer job_submit.lua to check for the
> presence of a license. Everything is architected within the SlurmDB and
> then controlled via job_submit.lua. In a small example (it's been a
> while) I extended to checkout licenses for different file systems, e.g.:
>
> $srun -N 1 -n 1 -t 01:00:00 -p myPartition -A myAccount -L home,work
> --pty /bin/bash -l
>
> then job_submit.lua would pick it up:
>
> srun: --> Checking file system licenses
> srun: --> Checking out a file system license for $HOME
> srun: --> Checking out a file system license for $WORK
>
> Note that it doesn't enforce file system usage (I didn't care to
> engineer it that far) but it could block if the user did not specify at
> submission time. If for whatever reason you wanted revoke all licenses,
> it is simply a matter of a one line modify command via sacctmgr. The
> idea for this one was that if one particular file system was on the
> fritz, revoke all the licenses for that file system. Any job that needed
> that license would stay queued while other jobs that used a different
> file system would keep humming along.
>
>
> Anyway, feel free to ping off-list too if there are other ideas that
> you'd like to spitball about.
>
> Best,
>
> Cyrus
>
> On 1/23/19 9:00 AM, Prentice Bisbal wrote:
>> Cyrus,
>>
>> Thanks for the input. Yes, I have considered features/constraints as
>> part of this, and I'm already using them for users to request IB. They
>> are definitely a key part of my strategy. I will look into Spank and
>> PriorityTiers. One of my goals is to reduce the amount of
>> scripting/customization I need to do, so if using Spank plugins
>> requires a lot of development on my part, that may be
>> counterproductive for me.
>>
>>> There are several ways to approach this and I imagine you really wish
>>> the users to be able to "just submit" with a minimum of effort and
>>> information on their part while your life is also manageable for changes
>>> or updates.
>> Not exactly. I wouldn't say I want them to 'just submit' with minimal
>> effort. I think that's a recipe for disaster - the don't specify the
>> right time limits, or correct resources, which then causes their job
>> to stay queued, prevent backfill scheduling from working, or they use
>> a node with 512 GB to run a single core job that only uses 4 GB of RAM.
>>
>> What I want is for my users to think about the *resources* they need
>> for their job, and not what partition they submit to. Right now, they
>> just think about what partition they want their job to run on, and
>> submit their job to that partition. Often, they will always use the
>> same queue for every job, regardless of the differing resource
>> requirements. While there is some logic as to why my cluster is
>> divided into the different partitions, I find most users ignore this
>> information, and just always submit to the same queue, job after job,
>> day after day, year after year. I want my users to stop thinking in
>> terms of partition names, and start thinking in terms of what
>> resources their job *really* needs. This will ultimately improve
>> cluster utilization, and reduce time spent in the queue. Some users
>> will submit a job, and as soon as it goes into the pending state, they
>> scancel it, change the partition name to a less utilized partition,
>> and resubmit it in the hopes it will start running immediately.
>>
>> Yes, there needs to be a lot of user training, and there's a lot I can
>> do to improve the environment for my users, but making the scheduler
>> more flexible needs to be one of the first steps in my vision to
>> improve things here.
>>
>> Prentice
>>
>> On 1/22/19 6:50 PM, Cyrus Proctor wrote:
>>> Hi Prentice,
>>>
>>> Have you considered Slurm features and constraints at all? You provide
>>> features (arbitrary strings in your slurm.conf) of what your hardware
>>> can provide ("amd", "ib", "FAST", "whatever"). A user then will list
>>> constraints using typical and/or/regex notation ( --constraint=amd&ib ).
>>> You may override or autofill constraint defaults yourself in your
>>> job_submit.lua.
>>>
>>> Another level: you may also create your own Slurm arguments to sbatch or
>>> srun using SPANK plugins. These could be used to simplify a constraint
>>> list in whatever way you might see fit (e.g. sbatch --fast equates to
>>> --constraint=amd&ib&FAST ).
>>>
>>> So, as a possibility, keep all nodes in one partition, supply the
>>> features in slurm.conf, have job_sumbit.lua give a default set of
>>> constraints (and/or force the user to provide a minimum set), create
>>> another partition that includes all the nodes as well but is
>>> preemptable/VIP/whatever (PriorityTiers work nice here too).
>>>
>>> There are several ways to approach this and I imagine you really wish
>>> the users to be able to "just submit" with a minimum of effort and
>>> information on their part while your life is also manageable for changes
>>> or updates. I find the logic of the feature/constraint system to be
>>> quite elegant for meeting complex needs of heterogeneous systems.
>>>
>>> Best,
>>>
>>> Cyrus
>>>
>>> On 1/22/19 2:49 PM, Prentice Bisbal wrote:
>>>> I left out a a *very* critical detail: One of the reasons I'm looking
>>>> at revamping my Slurm configuration is that my users have requested
>>>> the capability to submit long-running, low-priority interruptible jobs
>>>> that can be killed and requeued when shorter-running, higher-priority
>>>> jobs need to use the resources.
>>>>
>>>> Prentice Bisbal
>>>> Lead Software Engineer
>>>> Princeton Plasma Physics Laboratory
>>>> http://www.pppl.gov
>>>>
>>>> On 1/22/19 3:38 PM, Prentice Bisbal wrote:
>>>>> Slurm Users,
>>>>>
>>>>> I would like your input on the best way to configure Slurm for a
>>>>> heterogeneous cluster I am responsible for. This e-mail will probably
>>>>> be a bit long to include all the necessary details of my environment
>>>>> so thanks in advance to those of you who read all of it!
>>>>>
>>>>> The cluster I support is a very heterogeneous cluster with several
>>>>> different network technologies and generations of processors.
>>>>> Although some people here refer to this cluster as numerous l
>>>>> different clusters, in reality it is one cluster, since all the nodes
>>>>> have their work assigned to them from a single Slurm Controller, all
>>>>> the nodes use the same executables installed on a shared drive, and
>>>>> all nodes are diskless and use the same NFSroot OS image, so they are
>>>>> all configured 100% alike.
>>>>>
>>>>> The cluster has been built piece-meal over a number of years, which
>>>>> explains the variety of hardware/networking in use. In Slurm, each of
>>>>> the different "clusters" is a separate partition intended to serve
>>>>> different purposes:
>>>>>
>>>>> Partition "E" - AMD Opteron 6320 processors, 64 GB RAM/node, 1 GbE,
>>>>> meant for serial, and low task count parallel jobs that only use a
>>>>> few cores and stay within a single node. Limited to 16 tasks or less
>>>>> in QOS
>>>>>
>>>>> Partition "D" - AMD Opteron 6136, 6274, and 6376 processors, 32 GB or
>>>>> 64 GB RAM per node, 10 GbE, meant for general-purpose parallel jobs
>>>>> spanning multiple nodes. Min. Task count of 32 tasks to prevent
>>>>> smaller jobs that should be run on Partition E from running here.
>>>>>
>>>>> Partition "K" - AMD Opteron 6274 and 6376 processors, 64 GB RAM per
>>>>> node, DDR IB network, meant for tightly-coupled parallel jobs
>>>>>
>>>>> Partition "G1" - AMD Opteron 6274, 6276, 6376, and Intel Xeon E5-2698
>>>>> v3 & E5-2630 v3 processors, RAM ranging from 128 GB - 512 GB per
>>>>> node, 1 GbE Network, meant for "large memory" jobs - some nodes are
>>>>> in different racks attached to different switches, so not really
>>>>> optimal for multi-node jobs.
>>>>>
>>>>> Partition "J" - AMD Opteron 6136 Processors, 280 GB RAM per node,
>>>>> DDR IB, was orginally meant for a specific project, I now need to
>>>>> allow general access to it.
>>>>>
>>>>> Partition "G2" - AMD Opteron 6136, 6274, and 6320 processors, 32 GB,
>>>>> 96 GB, and 128 GB RAM per node, IB network , access is restricted to
>>>>> specific users/projects.
>>>>>
>>>>> Partition "M" - Intel Xeon E5-2698 v3 and E5-2697A v4 processors, 128
>>>>> GB RAM per node, 1 GbE network, reserved for running 1 specific
>>>>> simulation application.
>>>>>
>>>>> To make all this work so far, I have created a job_submit.lua script
>>>>> with numerous checks and conditionals that has become quite unwieldy.
>>>>> As a result, changes that should be simple take a considerable amount
>>>>> of time for me to rewrite and test the script. On top of that, almost
>>>>> all of the logic in that script is logic that Slurm can already
>>>>> perform in a more easily manageable way. I've essentially re-invented
>>>>> wheels that Slurm already provides.
>>>>>
>>>>> Further, each partition has it's own QOS, so my job_submit.lua
>>>>> assigns each job to a specific partition and QOS depending on it's
>>>>> resource requirements. This means that a job may be assigned to D,
>>>>> but could also run on K if K is idle , will never be able to run on
>>>>> K. This means cluster nodes could go unutilized, reducing cluster
>>>>> utilization states (which management looks at), and increasing job
>>>>> queue time (which users are obsessed with).
>>>>>
>>>>> I would like to simplify this configuration as much as possible to
>>>>> reduce the labor it takes me to maintain my job_submit.lua script,
>>>>> and therefore make me more responsive to meeting my users needs, and
>>>>> increase cluster utilization. Since I have numerous different
>>>>> networks, I was thinking the I could use the topology,conf file to
>>>>> keep jobs on a single network, and prevent multi-node jobs run on
>>>>> partition E. The partitions reserved for specific
>>>>> projects/departments would still need to be requested explicitly.
>>>>>
>>>>> At first, I was going to take this approach:
>>>>>
>>>>> 1. Create a single partition with all the general access nodes
>>>>>
>>>>> 2. Create a topology.conf file to make sure jobs stay within a single
>>>>> network.
>>>>>
>>>>> 3. Assign weights to the different partitions to that Slurm will try
>>>>> to assign jobs to them in a specific order of preference
>>>>>
>>>>> 4. Assign weights to the different nodes, so that the nodes with the
>>>>> fastest processors are preferred.
>>>>>
>>>>> After getting responses to my questions about the topology.conf file,
>>>>> this seems like this approach may not be viable, or at least not be
>>>>> best procedure.
>>>>>
>>>>> I'm am now considering this:
>>>>>
>>>>> 0. Restrict access to the non-general access partitions (this is
>>>>> already done for the most part, hence step 0).
>>>>>
>>>>> 1. Assign each Partition it's own QOS in the slurm.conf file.
>>>>>
>>>>> 2. Assign a weight to the partitions so Slurm attempts to assign jobs
>>>>> to them in a specific order.
>>>>>
>>>>> 3. Assign weights to the nodes so the nodes are assigned in a
>>>>> specific order (faster processors first)
>>>>>
>>>>> 4. Set job_submit plugin to all_partitions, or partition
>>>>>
>>>>>
>>>>> Step 4 in this case is the area I'm the least familiar with. One of
>>>>> the reasons we are using a job_submit.lua script is because users
>>>>> will often request partitions that are inappropriate for their job
>>>>> needs (like trying to run a job that spans multiple nodes on a
>>>>> partition with only 1 GbE, or request partition G because it's free,
>>>>> but their job only uses 1 MB of RAM). I'm also not sure if I want to
>>>>> give up using job_submit.lua 100% by switching job_submit_plugin to
>>>>> "partition"
>>>>>
>>>>> My ultimate goal is to have users specify what resources they need
>>>>> without specifying a QOS or Partition,and let Slurm handle that
>>>>> automatically based on the weights I assign to the nodes and
>>>>> partitions. I also don't want to lock a job to a specific partition
>>>>> at submit time so Slurm can allocate it to idle nodes in a different
>>>>> partition of that partition has idle nodes when the job is finally
>>>>> eligible to run.
>>>>>
>>>>> What is the best way to achieve my goals? All suggestions will be
>>>>> considered.
>>>>>
>>>>> For those of you who made it this far, thanks!
>>>>>
>>>>> Prentice
>>>>>
>>>>>
>>>>>
More information about the slurm-users
mailing list