<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>I agree that people are making updating slurm.conf a bigger issue
than people are making it out to be. However, there are certain
config changes that do require restarting the daemon rather than
just doing 'scontrol reconfigure.' these options are documented in
the slurm.conf documentation (just search for "restart")<br>
</p>
<p>I believe it's often only the slurmctld that needs to be
restarted, which is one daemon on one system, rather than
restarting slurmd on all the compute nodes, but there are a few
that require all slurm daemons being restarted. Adding nodes to a
cluster is one of them: <br>
</p>
<p>
<blockquote type="cite">Changes in node configuration (e.g. adding
nodes, changing their
processor count, etc.) require restarting both the slurmctld
daemon
and the slurmd daemons.
All slurmd daemons must know each node in the system to forward
messages in support of hierarchical communications</blockquote>
</p>
<p>But to avoid this, you can use the future setting to define
"future" nodes: <br>
</p>
<p>
<blockquote type="cite">
<dl compact="compact">
<dt><b>FUTURE</b></dt>
<dd>
Indicates the node is defined for future use and need not
exist when the Slurm daemons are started. These nodes can be
made available
for use simply by updating the node state using the scontrol
command rather
than restarting the slurmctld daemon. After these nodes are
made available,
change their State in the slurm.conf file. Until these nodes
are made
available, they will not be seen using any Slurm commands or
nor will
any attempt be made to contact them.
</dd>
</dl>
</blockquote>
</p>
<p>--<br>
Prentice<br>
</p>
<pre class="moz-signature" cols="72">
</pre>
<div class="moz-cite-prefix">On 5/4/21 8:32 AM, Sid Young wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAEZ+gOxsa=6gU2MnG8C9Mj8Metq-7drDoa4xSF3cmZ714vfD3A@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="auto">You can push a new conf file and issue an
"scontrol reconfigure" on the fly as needed... I do it on our
cluster as needed, do the nodes first then login nodes then the
slurm controller... you are making a huge issue of a very basic
task...
<div dir="auto"><br>
</div>
<div dir="auto">Sid</div>
<div dir="auto"><br>
</div>
</div>
</blockquote>
<blockquote type="cite"
cite="mid:CAEZ+gOxsa=6gU2MnG8C9Mj8Metq-7drDoa4xSF3cmZ714vfD3A@mail.gmail.com"><br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Tue, 4 May 2021, 22:28 Tina
Friedrich, <<a href="mailto:tina.friedrich@it.ox.ac.uk"
target="_blank" rel="noreferrer" moz-do-not-send="true">tina.friedrich@it.ox.ac.uk</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Hello,<br>
<br>
a lot of people already gave very good answer to how to tackle
this.<br>
<br>
Still, I thought it worth pointing this out - you said 'you
need to <br>
basically shut down slurm, update the slurm.conf file, then
restart'. <br>
That makes it sound like a major operation with lots of prep
required.<br>
<br>
It's not like that at all. Updating slurm.conf is not a major
operation.<br>
<br>
There's absolutely no reason to shut things down first &
then change the <br>
file. You can edit the file / ship out a new version (however
you like) <br>
and then restart the daemons.<br>
<br>
The daemons do not have to all be restarted simultaneously. It
is of no <br>
consequence if they're running with out-of-sync config files
for a bit, <br>
really. (There's a flag you can set if you want to suppress
the warning <br>
- 'NO_CONF_HASH' debug flag I think).<br>
<br>
Restarting the dameons (slurmctld, slurmd, ...) is safe. It
does not <br>
require cluster downtime or anything.<br>
<br>
I control slurm.conf using configuration management; the
config <br>
management process restarts the appropriate daemon (slurmctld,
slurmd, <br>
slurmdbd) if the file changed. This certainly never happens at
the same <br>
time; there's splay in that. It doesn't even necessarily
happen on the <br>
controller first, or anything like that.<br>
<br>
What I'm trying to get across - I have a feeling this
'updating the <br>
cluster wide config file' and 'file must be the same on all
nodes' is a <br>
lot less of a procedure (and a lot less strict) than you
currently <br>
imagine it to be :)<br>
<br>
Tina<br>
<br>
On 27/04/2021 19:35, David Henkemeyer wrote:<br>
> Hello,<br>
> <br>
> I'm new to Slurm (coming from PBS), and so I will likely
have a few <br>
> questions over the next several weeks, as I work to
transition my <br>
> infrastructure from PBS to Slurm.<br>
> <br>
> My first question has to do with *_adding nodes to
Slurm_*. According <br>
> to the FAQ (and other articles I've read), you need to
basically shut <br>
> down slurm, update the slurm.conf file /*on all nodes in
the cluster*/, <br>
> then restart slurm.<br>
> <br>
> - Why do all nodes need to know about all other nodes?
From what I have <br>
> read, its Slurm does a checksum comparison of the
slurm.conf file across <br>
> all nodes. Is this the only reason all nodes need to
know about all <br>
> other nodes?<br>
> - Can I create a symlink that
points <sysconfdir>/slurm.conf to a <br>
> slurm.conf file on an NFS mount point, which is mounted
on all the <br>
> nodes? This way, I would only need to update a single
file, then <br>
> restart Slurm across the entire cluster.<br>
> - Any additional help/resources for adding/removing nodes
to Slurm would <br>
> be much appreciated. Perhaps there is a "toolkit" out
there to automate <br>
> some of these operations (which is what I already have
for PBS, and will <br>
> create for Slurm, if something doesn't already exist).<br>
> <br>
> Thank you all,<br>
> <br>
> David<br>
<br>
-- <br>
Tina Friedrich, Advanced Research Computing Snr HPC Systems
Administrator<br>
<br>
Research Computing and Support Services<br>
IT Services, University of Oxford<br>
<a href="http://www.arc.ox.ac.uk" rel="noreferrer noreferrer
noreferrer" target="_blank" moz-do-not-send="true">http://www.arc.ox.ac.uk</a>
<a href="http://www.it.ox.ac.uk" rel="noreferrer noreferrer
noreferrer" target="_blank" moz-do-not-send="true">http://www.it.ox.ac.uk</a><br>
<br>
</blockquote>
</div>
</blockquote>
</body>
</html>