[slurm-users] [External] Re: Questions about adding new nodes to Slurm

Tue May 4 19:14:25 UTC 2021

I agree that people are making updating slurm.conf a bigger issue than 
people are making it out to be. However, there are certain config 
changes that do require restarting the daemon rather than just doing 
'scontrol reconfigure.' these options are documented in the slurm.conf 
documentation (just search for "restart")

I believe it's often only the slurmctld that needs to be restarted, 
which is one daemon on one system, rather than restarting slurmd on all 
the compute nodes, but there are a few that require all slurm daemons 
being restarted. Adding nodes to a cluster is one of them:

> Changes in node configuration (e.g. adding nodes, changing their 
> processor count, etc.) require restarting both the slurmctld daemon 
> and the slurmd daemons. All slurmd daemons must know each node in the 
> system to forward messages in support of hierarchical communications

But to avoid this, you can use the future setting to define "future" nodes:

> *FUTURE*
>     Indicates the node is defined for future use and need not exist
>     when the Slurm daemons are started. These nodes can be made
>     available for use simply by updating the node state using the
>     scontrol command rather than restarting the slurmctld daemon.
>     After these nodes are made available, change their State in the
>     slurm.conf file. Until these nodes are made available, they will
>     not be seen using any Slurm commands or nor will any attempt be
>     made to contact them. 
>
--
Prentice

On 5/4/21 8:32 AM, Sid Young wrote:
> You can push a new conf file and issue an "scontrol reconfigure" on 
> the fly as needed... I do it on our cluster as needed, do the nodes 
> first then login nodes then the slurm controller... you are making a 
> huge issue of a very basic task...
>
> Sid
>
>
> On Tue, 4 May 2021, 22:28 Tina Friedrich, <tina.friedrich at it.ox.ac.uk 
> <mailto:tina.friedrich at it.ox.ac.uk>> wrote:
>
>     Hello,
>
>     a lot of people already gave very good answer to how to tackle this.
>
>     Still, I thought it worth pointing this out - you said 'you need to
>     basically shut down slurm, update the slurm.conf file, then restart'.
>     That makes it sound like a major operation with lots of prep required.
>
>     It's not like that at all. Updating slurm.conf is not a major
>     operation.
>
>     There's absolutely no reason to shut things down first & then
>     change the
>     file. You can edit the file / ship out a new version (however you
>     like)
>     and then restart the daemons.
>
>     The daemons do not have to all be restarted simultaneously. It is
>     of no
>     consequence if they're running with out-of-sync config files for a
>     bit,
>     really. (There's a flag you can set if you want to suppress the
>     warning
>     - 'NO_CONF_HASH' debug flag I think).
>
>     Restarting the dameons (slurmctld, slurmd, ...) is safe. It does not
>     require cluster downtime or anything.
>
>     I control slurm.conf using configuration management; the config
>     management process restarts the appropriate daemon (slurmctld,
>     slurmd,
>     slurmdbd) if the file changed. This certainly never happens at the
>     same
>     time; there's splay in that. It doesn't even necessarily happen on
>     the
>     controller first, or anything like that.
>
>     What I'm trying to get across - I have a feeling this 'updating the
>     cluster wide config file' and 'file must be the same on all nodes'
>     is a
>     lot less of a procedure (and a lot less strict) than you currently
>     imagine it to be :)
>
>     Tina
>
>     On 27/04/2021 19:35, David Henkemeyer wrote:
>     > Hello,
>     >
>     > I'm new to Slurm (coming from PBS), and so I will likely have a few
>     > questions over the next several weeks, as I work to transition my
>     > infrastructure from PBS to Slurm.
>     >
>     > My first question has to do with *_adding nodes to Slurm_*. 
>     According
>     > to the FAQ (and other articles I've read), you need to basically
>     shut
>     > down slurm, update the slurm.conf file /*on all nodes in the
>     cluster*/,
>     > then restart slurm.
>     >
>     > - Why do all nodes need to know about all other nodes? From what
>     I have
>     > read, its Slurm does a checksum comparison of the slurm.conf
>     file across
>     > all nodes.  Is this the only reason all nodes need to know about
>     all
>     > other nodes?
>     > - Can I create a symlink that points <sysconfdir>/slurm.conf to a
>     > slurm.conf file on an NFS mount point, which is mounted on all the
>     > nodes?  This way, I would only need to update a single file, then
>     > restart Slurm across the entire cluster.
>     > - Any additional help/resources for adding/removing nodes to
>     Slurm would
>     > be much appreciated.  Perhaps there is a "toolkit" out there to
>     automate
>     > some of these operations (which is what I already have for PBS,
>     and will
>     > create for Slurm, if something doesn't already exist).
>     >
>     > Thank you all,
>     >
>     > David
>
>     -- 
>     Tina Friedrich, Advanced Research Computing Snr HPC Systems
>     Administrator
>
>     Research Computing and Support Services
>     IT Services, University of Oxford
>     http://www.arc.ox.ac.uk <http://www.arc.ox.ac.uk>
>     http://www.it.ox.ac.uk <http://www.it.ox.ac.uk>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210504/53335afe/attachment-0001.htm>