<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p><br>
</p>
<p>Run a secondary controller.</p>
<p>Do 'scontrol takeover' before any changes, make your changes and
restart slurmctld on the primary.</p>
<p>If it fails, no harm/no foul, because the secondary is still
running happily. If it succeeds, it takes control back and you can
then restart the secondary with the new (known good) config.</p>
<p><br>
</p>
<p>Brian Andrus<br>
</p>
<p><br>
</p>
<div class="moz-cite-prefix">On 1/17/2023 12:36 PM, Groner, Rob
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:BL0PR02MB44994063CBFAA2D771500B1E80C69@BL0PR02MB4499.namprd02.prod.outlook.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<style type="text/css" style="display:none;">P {margin-top:0;margin-bottom:0;}</style>
<div class="elementToProof"><span style="font-family: Calibri,
Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0,
0, 0); background-color: rgb(255, 255, 255);"
class="elementToProof">So, you have two equal sized clusters,
one for test and one for production? Our test cluster is a
small handful of machines compared to our production.</span></div>
<div class="elementToProof"><span style="font-family: Calibri,
Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0,
0, 0); background-color: rgb(255, 255, 255);"
class="elementToProof"><br>
</span></div>
<div class="elementToProof"><span style="font-family: Calibri,
Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0,
0, 0); background-color: rgb(255, 255, 255);"
class="elementToProof">We have a test slurm control node on a
test cluster with a test slurmdbd host and test nodes, all
named specifically for test. We don't want a situation where
our "test" slurm controller node is named the same as our
"prod" slurm controller node, because the possibility of
mistake is too great. ("I THOUGHT I was on the test
network....")</span></div>
<div class="elementToProof"><span style="font-family: Calibri,
Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0,
0, 0); background-color: rgb(255, 255, 255);"
class="elementToProof"><br>
</span></div>
<div class="elementToProof"><span style="font-family: Calibri,
Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0,
0, 0); background-color: rgb(255, 255, 255);"
class="elementToProof">Here's the ultimate question I'm trying
to get answered.... Does anyone update their slurm.conf file
on production outside of an outage? If so, how do you KNOW
the slurmctld won't barf on some problem in the file you
didn't see (even a mistaken character in there would do it)?
We're trying to move to a model where we don't have downtimes
as often, so I need to determine a reliable way to continue to
add features to slurm without having to wait for the next
outage. There's no way I know of to prove the slurm.conf file
is good, except by feeding it to slurmctld and crossing my
fingers.</span></div>
<div class="elementToProof"><span style="font-family: Calibri,
Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0,
0, 0); background-color: rgb(255, 255, 255);"
class="elementToProof"><br>
</span></div>
<div class="elementToProof"><span style="font-family: Calibri,
Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0,
0, 0); background-color: rgb(255, 255, 255);"
class="elementToProof">Rob</span></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif;
font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<hr tabindex="-1" style="display:inline-block; width:98%">
<div id="divRplyFwdMsg" dir="ltr"><font style="font-size: 11pt;"
face="Calibri, sans-serif" color="#000000"><b>From:</b>
slurm-users <a class="moz-txt-link-rfc2396E" href="mailto:slurm-users-bounces@lists.schedmd.com"><slurm-users-bounces@lists.schedmd.com></a> on
behalf of Fulcomer, Samuel <a class="moz-txt-link-rfc2396E" href="mailto:samuel_fulcomer@brown.edu"><samuel_fulcomer@brown.edu></a><br>
<b>Sent:</b> Wednesday, January 4, 2023 1:54 PM<br>
<b>To:</b> Slurm User Community List
<a class="moz-txt-link-rfc2396E" href="mailto:slurm-users@lists.schedmd.com"><slurm-users@lists.schedmd.com></a><br>
<b>Subject:</b> Re: [slurm-users] Maintaining slurm config
files for test and production clusters</font>
<div> </div>
</div>
<div>
<table style="border:0; display:table; width:100%;
table-layout:fixed; border-collapse:seperate; float:none"
width="100%" cellspacing="0" cellpadding="0" border="0"
align="left">
<tbody style="display:block">
<tr>
<td cellpadding="7px 2px 7px 2px" style="padding: 7px 2px;
background-color: rgb(166, 166, 166);" width="1px"
valign="middle" bgcolor="#A6A6A6">
<br>
</td>
<td cellpadding="7px 5px 7px 15px" color="#212121"
style="width: 100%; padding: 7px 5px 7px 15px;
font-family: wf_segoe-ui_normal, "Segoe UI",
"Segoe WP", Tahoma, Arial, sans-serif;
font-size: 12px; font-weight: normal; text-align: left;
overflow-wrap: break-word; background-color: rgb(234,
234, 234); color: rgb(33, 33, 33);" width="100%"
valign="middle" bgcolor="#EAEAEA">
<div>You don't often get email from
<a class="moz-txt-link-abbreviated" href="mailto:samuel_fulcomer@brown.edu">samuel_fulcomer@brown.edu</a>. <a
href="https://aka.ms/LearnAboutSenderIdentification"
data-auth="NotApplicable" data-loopstyle="link"
moz-do-not-send="true">
Learn why this is important</a></div>
</td>
<td cellpadding="7px 5px 7px 5px" color="#212121"
style="width: 75px; padding: 7px 5px; font-family:
wf_segoe-ui_normal, "Segoe UI", "Segoe
WP", Tahoma, Arial, sans-serif; font-size: 12px;
font-weight: normal; text-align: left; overflow-wrap:
break-word; background-color: rgb(234, 234, 234); color:
rgb(33, 33, 33);" width="75px" valign="middle"
bgcolor="#EAEAEA" align="left">
<br>
</td>
</tr>
</tbody>
</table>
<div>
<div dir="ltr">Just make the cluster names the same, with
different Nodename and Partition lines. The rest of
slurm.conf can be the same. Having two cluster names is only
necessary if you're running production in a multi-cluster
configuration.
<div><br>
</div>
<div>Our model has been to have a production cluster and a
test cluster which becomes the production cluster at
yearly upgrade time (for us, next week). The test cluster
is also used for rebuilding MPI prior to the upgrade, when
the PMI changes. We force users to resubmit jobs at
upgrade time (after the maintenance reservation) to ensure
that MPI runs correctly.</div>
<div><br>
</div>
<div><br>
</div>
</div>
<br>
<div class="x_gmail_quote">
<div dir="ltr" class="x_gmail_attr">On Wed, Jan 4, 2023 at
12:26 PM Groner, Rob <<a href="mailto:rug262@psu.edu"
data-auth="NotApplicable" data-loopstyle="link"
moz-do-not-send="true" class="moz-txt-link-freetext">rug262@psu.edu</a>>
wrote:<br>
</div>
<blockquote class="x_gmail_quote" style="margin:0px 0px 0px
0.8ex; border-left:1px solid rgb(204,204,204);
padding-left:1ex">
<div class="x_msg-7556422008998512349">
<div dir="ltr">
<div><span style="font-family: Calibri, Arial,
Helvetica, sans-serif; font-size: 12pt; color:
rgb(0, 0, 0); background-color: rgb(255, 255,
255);">We currently have a test cluster and a
production cluster, all on the same network. We
try things on the test cluster, and then we gather
those changes and make a change to the production
cluster. We're doing that through two different
repos, but we'd like to have a single repo to make
the transition from testing configs to publishing
them more seamless. The problem is, of course,
that the test cluster and production clusters have
different cluster names, as well as different
nodes within them.</span></div>
<div><span style="font-family: Calibri, Arial,
Helvetica, sans-serif; font-size: 12pt; color:
rgb(0, 0, 0); background-color: rgb(255, 255,
255);"><br>
</span></div>
<div><span style="font-family: Calibri, Arial,
Helvetica, sans-serif; font-size: 12pt; color:
rgb(0, 0, 0); background-color: rgb(255, 255,
255);">Using the include directive, I can pull all
of the NodeName lines out of slurm.conf and put
them into %c-nodes.conf files, one for production,
one for test. That still leaves me with two
problems:</span></div>
<div>
<ul>
<li style="font-size: 12pt; font-family: Calibri,
Arial, Helvetica, sans-serif; color: rgb(0, 0,
0); background-color: rgb(255, 255, 255);">
<span style="font-family: Calibri, Arial,
Helvetica, sans-serif; font-size: 12pt; color:
rgb(0, 0, 0); background-color: rgb(255, 255,
255);">The clustername itself will still be a
problem. I WANT the same slurm.conf file
between test and production...but the
clustername line will be different for them
both. Can I use an env var in that cluster
name, because on production there could be a
different env var value than on test?</span></li>
<li style="font-size: 12pt; font-family: Calibri,
Arial, Helvetica, sans-serif; color: rgb(0, 0,
0); background-color: rgb(255, 255, 255);">
<span style="font-family: Calibri, Arial,
Helvetica, sans-serif; font-size: 12pt; color:
rgb(0, 0, 0); background-color: rgb(255, 255,
255);">The gres.conf file. I tried using the
same "include" trick that works on slurm.conf,
but it failed because it did not know what the
"ClusterName" was. I think that means that
either it doesn't work for anything other than
slurm.conf, or that the clustername will have
to be defined in gres.conf as well?</span></li>
</ul>
<div><span style="font-family: Calibri, Arial,
Helvetica, sans-serif; font-size: 12pt; color:
rgb(0, 0, 0); background-color: rgb(255, 255,
255);">Any other suggestions of how to keep our
slurm files in a single source control repo, but
still have the flexibility to have them run
elegantly on either test or production systems?</span></div>
<div><span style="font-family: Calibri, Arial,
Helvetica, sans-serif; font-size: 12pt; color:
rgb(0, 0, 0); background-color: rgb(255, 255,
255);"><br>
</span></div>
<div><span style="font-family: Calibri, Arial,
Helvetica, sans-serif; font-size: 12pt; color:
rgb(0, 0, 0); background-color: rgb(255, 255,
255);">Thanks.</span></div>
<div><span style="font-family: Calibri, Arial,
Helvetica, sans-serif; font-size: 12pt; color:
rgb(0, 0, 0); background-color: rgb(255, 255,
255);"><br>
</span></div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
</body>
</html>