<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p><br>
    </p>
    <p>Run a secondary controller.</p>
    <p>Do 'scontrol takeover' before any changes, make your changes and
      restart slurmctld on the primary.</p>
    <p>If it fails, no harm/no foul, because the secondary is still
      running happily. If it succeeds, it takes control back and you can
      then restart the secondary with the new (known good) config.</p>
    <p><br>
    </p>
    <p>Brian Andrus<br>
    </p>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 1/17/2023 12:36 PM, Groner, Rob
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:BL0PR02MB44994063CBFAA2D771500B1E80C69@BL0PR02MB4499.namprd02.prod.outlook.com">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <style type="text/css" style="display:none;">P {margin-top:0;margin-bottom:0;}</style>
      <div class="elementToProof"><span style="font-family: Calibri,
          Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0,
          0, 0); background-color: rgb(255, 255, 255);"
          class="elementToProof">So, you have two equal sized clusters,
          one for test and one for production?  Our test cluster is a
          small handful of machines compared to our production.</span></div>
      <div class="elementToProof"><span style="font-family: Calibri,
          Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0,
          0, 0); background-color: rgb(255, 255, 255);"
          class="elementToProof"><br>
        </span></div>
      <div class="elementToProof"><span style="font-family: Calibri,
          Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0,
          0, 0); background-color: rgb(255, 255, 255);"
          class="elementToProof">We have a test slurm control node on a
          test cluster with a test slurmdbd host and test nodes, all
          named specifically for test.  We don't want a situation where
          our "test" slurm controller node is named the same as our
          "prod" slurm controller node, because the possibility of
          mistake is too great.  ("I THOUGHT I was on the test
          network....")</span></div>
      <div class="elementToProof"><span style="font-family: Calibri,
          Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0,
          0, 0); background-color: rgb(255, 255, 255);"
          class="elementToProof"><br>
        </span></div>
      <div class="elementToProof"><span style="font-family: Calibri,
          Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0,
          0, 0); background-color: rgb(255, 255, 255);"
          class="elementToProof">Here's the ultimate question I'm trying
          to get answered....  Does anyone update their slurm.conf file
          on production outside of an outage?  If so, how do you KNOW
          the slurmctld won't barf on some problem in the file you
          didn't see (even a mistaken character in there would do it)? 
          We're trying to move to a model where we don't have downtimes
          as often, so I need to determine a reliable way to continue to
          add features to slurm without having to wait for the next
          outage.  There's no way I know of to prove the slurm.conf file
          is good, except by feeding it to slurmctld and crossing my
          fingers.</span></div>
      <div class="elementToProof"><span style="font-family: Calibri,
          Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0,
          0, 0); background-color: rgb(255, 255, 255);"
          class="elementToProof"><br>
        </span></div>
      <div class="elementToProof"><span style="font-family: Calibri,
          Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0,
          0, 0); background-color: rgb(255, 255, 255);"
          class="elementToProof">Rob</span></div>
      <div style="font-family: Calibri, Arial, Helvetica, sans-serif;
        font-size: 12pt; color: rgb(0, 0, 0);">
        <br>
      </div>
      <hr tabindex="-1" style="display:inline-block; width:98%">
      <div id="divRplyFwdMsg" dir="ltr"><font style="font-size: 11pt;"
          face="Calibri, sans-serif" color="#000000"><b>From:</b>
          slurm-users <a class="moz-txt-link-rfc2396E" href="mailto:slurm-users-bounces@lists.schedmd.com"><slurm-users-bounces@lists.schedmd.com></a> on
          behalf of Fulcomer, Samuel <a class="moz-txt-link-rfc2396E" href="mailto:samuel_fulcomer@brown.edu"><samuel_fulcomer@brown.edu></a><br>
          <b>Sent:</b> Wednesday, January 4, 2023 1:54 PM<br>
          <b>To:</b> Slurm User Community List
          <a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com"><slurm-users@lists.schedmd.com></a><br>
          <b>Subject:</b> Re: [slurm-users] Maintaining slurm config
          files for test and production clusters</font>
        <div> </div>
      </div>
      <div>
        <table style="border:0; display:table; width:100%;
          table-layout:fixed; border-collapse:seperate; float:none"
          width="100%" cellspacing="0" cellpadding="0" border="0"
          align="left">
          <tbody style="display:block">
            <tr>
              <td cellpadding="7px 2px 7px 2px" style="padding: 7px 2px;
                background-color: rgb(166, 166, 166);" width="1px"
                valign="middle" bgcolor="#A6A6A6">
                <br>
              </td>
              <td cellpadding="7px 5px 7px 15px" color="#212121"
                style="width: 100%; padding: 7px 5px 7px 15px;
                font-family: wf_segoe-ui_normal, "Segoe UI",
                "Segoe WP", Tahoma, Arial, sans-serif;
                font-size: 12px; font-weight: normal; text-align: left;
                overflow-wrap: break-word; background-color: rgb(234,
                234, 234); color: rgb(33, 33, 33);" width="100%"
                valign="middle" bgcolor="#EAEAEA">
                <div>You don't often get email from
                  <a class="moz-txt-link-abbreviated" href="mailto:samuel_fulcomer@brown.edu">samuel_fulcomer@brown.edu</a>. <a
                    href="https://aka.ms/LearnAboutSenderIdentification"
                    data-auth="NotApplicable" data-loopstyle="link"
                    moz-do-not-send="true">
                    Learn why this is important</a></div>
              </td>
              <td cellpadding="7px 5px 7px 5px" color="#212121"
                style="width: 75px; padding: 7px 5px; font-family:
                wf_segoe-ui_normal, "Segoe UI", "Segoe
                WP", Tahoma, Arial, sans-serif; font-size: 12px;
                font-weight: normal; text-align: left; overflow-wrap:
                break-word; background-color: rgb(234, 234, 234); color:
                rgb(33, 33, 33);" width="75px" valign="middle"
                bgcolor="#EAEAEA" align="left">
                <br>
              </td>
            </tr>
          </tbody>
        </table>
        <div>
          <div dir="ltr">Just make the cluster names the same, with
            different Nodename and Partition lines. The rest of
            slurm.conf can be the same. Having two cluster names is only
            necessary if you're running production in a multi-cluster
            configuration.
            <div><br>
            </div>
            <div>Our model has been to have a production cluster and a
              test cluster which becomes the production cluster at
              yearly upgrade time (for us, next week). The test cluster
              is also used for rebuilding MPI prior to the upgrade, when
              the PMI changes. We force users to resubmit jobs at
              upgrade time (after the maintenance reservation) to ensure
              that MPI runs correctly.</div>
            <div><br>
            </div>
            <div><br>
            </div>
          </div>
          <br>
          <div class="x_gmail_quote">
            <div dir="ltr" class="x_gmail_attr">On Wed, Jan 4, 2023 at
              12:26 PM Groner, Rob <<a href="mailto:rug262@psu.edu"
                data-auth="NotApplicable" data-loopstyle="link"
                moz-do-not-send="true" class="moz-txt-link-freetext">rug262@psu.edu</a>>
              wrote:<br>
            </div>
            <blockquote class="x_gmail_quote" style="margin:0px 0px 0px
              0.8ex; border-left:1px solid rgb(204,204,204);
              padding-left:1ex">
              <div class="x_msg-7556422008998512349">
                <div dir="ltr">
                  <div><span style="font-family: Calibri, Arial,
                      Helvetica, sans-serif; font-size: 12pt; color:
                      rgb(0, 0, 0); background-color: rgb(255, 255,
                      255);">We currently have a test cluster and a
                      production cluster, all on the same network.  We
                      try things on the test cluster, and then we gather
                      those changes and make a change to the production
                      cluster.  We're doing that through two different
                      repos, but we'd like to have a single repo to make
                      the transition from testing configs to publishing
                      them more seamless.  The problem is, of course,
                      that the test cluster and production clusters have
                      different cluster names, as well as different
                      nodes within them.</span></div>
                  <div><span style="font-family: Calibri, Arial,
                      Helvetica, sans-serif; font-size: 12pt; color:
                      rgb(0, 0, 0); background-color: rgb(255, 255,
                      255);"><br>
                    </span></div>
                  <div><span style="font-family: Calibri, Arial,
                      Helvetica, sans-serif; font-size: 12pt; color:
                      rgb(0, 0, 0); background-color: rgb(255, 255,
                      255);">Using the include directive, I can pull all
                      of the NodeName lines out of slurm.conf and put
                      them into %c-nodes.conf files, one for production,
                      one for test.  That still leaves me with two
                      problems:</span></div>
                  <div>
                    <ul>
                      <li style="font-size: 12pt; font-family: Calibri,
                        Arial, Helvetica, sans-serif; color: rgb(0, 0,
                        0); background-color: rgb(255, 255, 255);">
                        <span style="font-family: Calibri, Arial,
                          Helvetica, sans-serif; font-size: 12pt; color:
                          rgb(0, 0, 0); background-color: rgb(255, 255,
                          255);">The clustername itself will still be a
                          problem.  I WANT the same slurm.conf file
                          between test and production...but the
                          clustername line will be different for them
                          both.  Can I use an env var in that cluster
                          name, because on production there could be a
                          different env var value than on test?</span></li>
                      <li style="font-size: 12pt; font-family: Calibri,
                        Arial, Helvetica, sans-serif; color: rgb(0, 0,
                        0); background-color: rgb(255, 255, 255);">
                        <span style="font-family: Calibri, Arial,
                          Helvetica, sans-serif; font-size: 12pt; color:
                          rgb(0, 0, 0); background-color: rgb(255, 255,
                          255);">The gres.conf file.  I tried using the
                          same "include" trick that works on slurm.conf,
                          but it failed because it did not know what the
                          "ClusterName" was.  I think that means that
                          either it doesn't work for anything other than
                          slurm.conf, or that the clustername will have
                          to be defined in gres.conf as well?</span></li>
                    </ul>
                    <div><span style="font-family: Calibri, Arial,
                        Helvetica, sans-serif; font-size: 12pt; color:
                        rgb(0, 0, 0); background-color: rgb(255, 255,
                        255);">Any other suggestions of how to keep our
                        slurm files in a single source control repo, but
                        still have the flexibility to have them run
                        elegantly on either test or production systems?</span></div>
                    <div><span style="font-family: Calibri, Arial,
                        Helvetica, sans-serif; font-size: 12pt; color:
                        rgb(0, 0, 0); background-color: rgb(255, 255,
                        255);"><br>
                      </span></div>
                    <div><span style="font-family: Calibri, Arial,
                        Helvetica, sans-serif; font-size: 12pt; color:
                        rgb(0, 0, 0); background-color: rgb(255, 255,
                        255);">Thanks.</span></div>
                    <div><span style="font-family: Calibri, Arial,
                        Helvetica, sans-serif; font-size: 12pt; color:
                        rgb(0, 0, 0); background-color: rgb(255, 255,
                        255);"><br>
                      </span></div>
                  </div>
                </div>
              </div>
            </blockquote>
          </div>
        </div>
      </div>
    </blockquote>
  </body>
</html>