<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>1. Part of the communications for slurm is hierarchical. Thus
nodes need to know about other nodes so they can talk to each
other and forward messages to the slurmctld.</p>
<p>2. Yes, this is what we do. We have our slurm.conf shared via
NFS from our slurm master and then we just update that single
conf. After that update we then use salt to issue a global
restart to all the slurmd's and slurmctld to pick up the new
config. scontrol reconfigure is not enough when adding new nodes,
you have to issue a global restart.</p>
<p>3. It's pretty straight forward all told. You just need to
update the slurm.conf and do a restart. You need to be careful
that the names you enter into the slurm.conf are resolvable by
DNS, else slurmctld may barf on restart. Sadly no built in sanity
checker exists that I am aware of aside from actually running
slurmctld. We got around this by putting together a gitlab runner
which screens our slurm.conf's by running synthetic slurmctld to
sanity check.</p>
<p>-Paul Edmon-<br>
</p>
<div class="moz-cite-prefix">On 4/27/2021 2:35 PM, David Henkemeyer
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CABjsmAHC1YKwbZ34_9xj3g_RPKkbhTMu0eoa7Ke-pAA20KgkWQ@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div class="gmail_quote">
<div dir="ltr">Hello,
<div><br>
</div>
<div>I'm new to Slurm (coming from PBS), and so I will
likely have a few questions over the next several weeks,
as I work to transition my infrastructure from PBS to
Slurm.</div>
<div><br>
</div>
<div>My first question has to do with <b><u>adding nodes to
Slurm</u></b>. According to the FAQ (and other
articles I've read), you need to basically shut down
slurm, update the slurm.conf file <i><b>on all nodes in
the cluster</b></i>, then restart slurm.</div>
<div><br>
</div>
<div>- Why do all nodes need to know about all other nodes?
From what I have read, its Slurm does a checksum
comparison of the slurm.conf file across all nodes. Is
this the only reason all nodes need to know about all
other nodes? </div>
<div>- Can I create a symlink that
points <sysconfdir>/slurm.conf to a slurm.conf file
on an NFS mount point, which is mounted on all the nodes?
This way, I would only need to update a single file, then
restart Slurm across the entire cluster.</div>
<div>- Any additional help/resources for adding/removing
nodes to Slurm would be much appreciated. Perhaps there
is a "toolkit" out there to automate some of these
operations (which is what I already have for PBS, and will
create for Slurm, if something doesn't already exist).</div>
<div><br>
</div>
<div>Thank you all,</div>
<div><br>
</div>
<div>David</div>
</div>
</div>
</div>
</blockquote>
</body>
</html>