[slurm-users] Maintaining slurm config files for test and production clusters
Brian Andrus
toomuchit at gmail.com
Tue Jan 17 22:54:29 UTC 2023
Run a secondary controller.
Do 'scontrol takeover' before any changes, make your changes and restart
slurmctld on the primary.
If it fails, no harm/no foul, because the secondary is still running
happily. If it succeeds, it takes control back and you can then restart
the secondary with the new (known good) config.
Brian Andrus
On 1/17/2023 12:36 PM, Groner, Rob wrote:
> So, you have two equal sized clusters, one for test and one for
> production? Our test cluster is a small handful of machines compared
> to our production.
>
> We have a test slurm control node on a test cluster with a test
> slurmdbd host and test nodes, all named specifically for test. We
> don't want a situation where our "test" slurm controller node is named
> the same as our "prod" slurm controller node, because the possibility
> of mistake is too great. ("I THOUGHT I was on the test network....")
>
> Here's the ultimate question I'm trying to get answered.... Does
> anyone update their slurm.conf file on production outside of an
> outage? If so, how do you KNOW the slurmctld won't barf on some
> problem in the file you didn't see (even a mistaken character in there
> would do it)? We're trying to move to a model where we don't have
> downtimes as often, so I need to determine a reliable way to continue
> to add features to slurm without having to wait for the next outage.
> There's no way I know of to prove the slurm.conf file is good, except
> by feeding it to slurmctld and crossing my fingers.
>
> Rob
>
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf
> of Fulcomer, Samuel <samuel_fulcomer at brown.edu>
> *Sent:* Wednesday, January 4, 2023 1:54 PM
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] Maintaining slurm config files for test
> and production clusters
>
>
> You don't often get email from samuel_fulcomer at brown.edu. Learn why
> this is important <https://aka.ms/LearnAboutSenderIdentification>
>
>
> Just make the cluster names the same, with different Nodename and
> Partition lines. The rest of slurm.conf can be the same. Having two
> cluster names is only necessary if you're running production in a
> multi-cluster configuration.
>
> Our model has been to have a production cluster and a test cluster
> which becomes the production cluster at yearly upgrade time (for us,
> next week). The test cluster is also used for rebuilding MPI prior to
> the upgrade, when the PMI changes. We force users to resubmit jobs at
> upgrade time (after the maintenance reservation) to ensure that MPI
> runs correctly.
>
>
>
> On Wed, Jan 4, 2023 at 12:26 PM Groner, Rob <rug262 at psu.edu> wrote:
>
> We currently have a test cluster and a production cluster, all on
> the same network. We try things on the test cluster, and then we
> gather those changes and make a change to the production cluster.
> We're doing that through two different repos, but we'd like to
> have a single repo to make the transition from testing configs to
> publishing them more seamless. The problem is, of course, that
> the test cluster and production clusters have different cluster
> names, as well as different nodes within them.
>
> Using the include directive, I can pull all of the NodeName lines
> out of slurm.conf and put them into %c-nodes.conf files, one for
> production, one for test. That still leaves me with two problems:
>
> * The clustername itself will still be a problem. I WANT the
> same slurm.conf file between test and production...but the
> clustername line will be different for them both. Can I use
> an env var in that cluster name, because on production there
> could be a different env var value than on test?
> * The gres.conf file. I tried using the same "include" trick
> that works on slurm.conf, but it failed because it did not
> know what the "ClusterName" was. I think that means that
> either it doesn't work for anything other than slurm.conf, or
> that the clustername will have to be defined in gres.conf as well?
>
> Any other suggestions of how to keep our slurm files in a single
> source control repo, but still have the flexibility to have them
> run elegantly on either test or production systems?
>
> Thanks.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230117/b9768f71/attachment.htm>
More information about the slurm-users
mailing list