[slurm-users] Maintaining slurm config files for test and production clusters

Tue Jan 17 22:54:29 UTC 2023

Run a secondary controller.

Do 'scontrol takeover' before any changes, make your changes and restart 
slurmctld on the primary.

If it fails, no harm/no foul, because the secondary is still running 
happily. If it succeeds, it takes control back and you can then restart 
the secondary with the new (known good) config.

Brian Andrus

On 1/17/2023 12:36 PM, Groner, Rob wrote:
> So, you have two equal sized clusters, one for test and one for 
> production?  Our test cluster is a small handful of machines compared 
> to our production.
>
> We have a test slurm control node on a test cluster with a test 
> slurmdbd host and test nodes, all named specifically for test.  We 
> don't want a situation where our "test" slurm controller node is named 
> the same as our "prod" slurm controller node, because the possibility 
> of mistake is too great.  ("I THOUGHT I was on the test network....")
>
> Here's the ultimate question I'm trying to get answered....  Does 
> anyone update their slurm.conf file on production outside of an 
> outage?  If so, how do you KNOW the slurmctld won't barf on some 
> problem in the file you didn't see (even a mistaken character in there 
> would do it)? We're trying to move to a model where we don't have 
> downtimes as often, so I need to determine a reliable way to continue 
> to add features to slurm without having to wait for the next outage.  
> There's no way I know of to prove the slurm.conf file is good, except 
> by feeding it to slurmctld and crossing my fingers.
>
> Rob
>
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf 
> of Fulcomer, Samuel <samuel_fulcomer at brown.edu>
> *Sent:* Wednesday, January 4, 2023 1:54 PM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] Maintaining slurm config files for test 
> and production clusters
>
> 	
> You don't often get email from samuel_fulcomer at brown.edu. Learn why 
> this is important <https://aka.ms/LearnAboutSenderIdentification>
> 	
>
> Just make the cluster names the same, with different Nodename and 
> Partition lines. The rest of slurm.conf can be the same. Having two 
> cluster names is only necessary if you're running production in a 
> multi-cluster configuration.
>
> Our model has been to have a production cluster and a test cluster 
> which becomes the production cluster at yearly upgrade time (for us, 
> next week). The test cluster is also used for rebuilding MPI prior to 
> the upgrade, when the PMI changes. We force users to resubmit jobs at 
> upgrade time (after the maintenance reservation) to ensure that MPI 
> runs correctly.
>
>
>
> On Wed, Jan 4, 2023 at 12:26 PM Groner, Rob <rug262 at psu.edu> wrote:
>
>     We currently have a test cluster and a production cluster, all on
>     the same network.  We try things on the test cluster, and then we
>     gather those changes and make a change to the production cluster. 
>     We're doing that through two different repos, but we'd like to
>     have a single repo to make the transition from testing configs to
>     publishing them more seamless.  The problem is, of course, that
>     the test cluster and production clusters have different cluster
>     names, as well as different nodes within them.
>
>     Using the include directive, I can pull all of the NodeName lines
>     out of slurm.conf and put them into %c-nodes.conf files, one for
>     production, one for test.  That still leaves me with two problems:
>
>       * The clustername itself will still be a problem.  I WANT the
>         same slurm.conf file between test and production...but the
>         clustername line will be different for them both.  Can I use
>         an env var in that cluster name, because on production there
>         could be a different env var value than on test?
>       * The gres.conf file.  I tried using the same "include" trick
>         that works on slurm.conf, but it failed because it did not
>         know what the "ClusterName" was.  I think that means that
>         either it doesn't work for anything other than slurm.conf, or
>         that the clustername will have to be defined in gres.conf as well?
>
>     Any other suggestions of how to keep our slurm files in a single
>     source control repo, but still have the flexibility to have them
>     run elegantly on either test or production systems?
>
>     Thanks.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230117/b9768f71/attachment.htm>