[slurm-users] Questions about adding new nodes to Slurm

Tina Friedrich tina.friedrich at it.ox.ac.uk
Tue May 4 12:47:56 UTC 2021


Not sure if that's changed but aren't there cases where 'scontrol 
reconfigure' isn't sufficient? (Like adding nodes?)

But yes, that's my point exactly; it is a pretty basic day to day task 
to update slurm.conf, not some daunting operation that requires a 
downtime or anything like it. (I remember this requirement to update the 
config file everywhere & restart everything sounding like a major task 
that requires announcements & downtimes to me when I started with SLURM 
- coming from Grid Engine - and it took me while to figure out, and 
trust, that an update to slurm.conf is a very minor task, and not a 
risky one really :) ))

Tina

On 04/05/2021 13:32, Sid Young wrote:
> You can push a new conf file and issue an "scontrol reconfigure" on the 
> fly as needed... I do it on our cluster as needed, do the nodes first 
> then login nodes then the slurm controller... you are making a huge 
> issue of a very basic task...
> 
> Sid
> 
> 
> On Tue, 4 May 2021, 22:28 Tina Friedrich, <tina.friedrich at it.ox.ac.uk 
> <mailto:tina.friedrich at it.ox.ac.uk>> wrote:
> 
>     Hello,
> 
>     a lot of people already gave very good answer to how to tackle this.
> 
>     Still, I thought it worth pointing this out - you said 'you need to
>     basically shut down slurm, update the slurm.conf file, then restart'.
>     That makes it sound like a major operation with lots of prep required.
> 
>     It's not like that at all. Updating slurm.conf is not a major operation.
> 
>     There's absolutely no reason to shut things down first & then change
>     the
>     file. You can edit the file / ship out a new version (however you like)
>     and then restart the daemons.
> 
>     The daemons do not have to all be restarted simultaneously. It is of no
>     consequence if they're running with out-of-sync config files for a bit,
>     really. (There's a flag you can set if you want to suppress the warning
>     - 'NO_CONF_HASH' debug flag I think).
> 
>     Restarting the dameons (slurmctld, slurmd, ...) is safe. It does not
>     require cluster downtime or anything.
> 
>     I control slurm.conf using configuration management; the config
>     management process restarts the appropriate daemon (slurmctld, slurmd,
>     slurmdbd) if the file changed. This certainly never happens at the same
>     time; there's splay in that. It doesn't even necessarily happen on the
>     controller first, or anything like that.
> 
>     What I'm trying to get across - I have a feeling this 'updating the
>     cluster wide config file' and 'file must be the same on all nodes' is a
>     lot less of a procedure (and a lot less strict) than you currently
>     imagine it to be :)
> 
>     Tina
> 
>     On 27/04/2021 19:35, David Henkemeyer wrote:
>      > Hello,
>      >
>      > I'm new to Slurm (coming from PBS), and so I will likely have a few
>      > questions over the next several weeks, as I work to transition my
>      > infrastructure from PBS to Slurm.
>      >
>      > My first question has to do with *_adding nodes to Slurm_*. 
>     According
>      > to the FAQ (and other articles I've read), you need to basically
>     shut
>      > down slurm, update the slurm.conf file /*on all nodes in the
>     cluster*/,
>      > then restart slurm.
>      >
>      > - Why do all nodes need to know about all other nodes?  From what
>     I have
>      > read, its Slurm does a checksum comparison of the slurm.conf file
>     across
>      > all nodes.  Is this the only reason all nodes need to know about all
>      > other nodes?
>      > - Can I create a symlink that points <sysconfdir>/slurm.conf to a
>      > slurm.conf file on an NFS mount point, which is mounted on all the
>      > nodes?  This way, I would only need to update a single file, then
>      > restart Slurm across the entire cluster.
>      > - Any additional help/resources for adding/removing nodes to
>     Slurm would
>      > be much appreciated.  Perhaps there is a "toolkit" out there to
>     automate
>      > some of these operations (which is what I already have for PBS,
>     and will
>      > create for Slurm, if something doesn't already exist).
>      >
>      > Thank you all,
>      >
>      > David
> 
>     -- 
>     Tina Friedrich, Advanced Research Computing Snr HPC Systems
>     Administrator
> 
>     Research Computing and Support Services
>     IT Services, University of Oxford
>     http://www.arc.ox.ac.uk <http://www.arc.ox.ac.uk>
>     http://www.it.ox.ac.uk <http://www.it.ox.ac.uk>
> 

-- 
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk



More information about the slurm-users mailing list