[slurm-users] Questions about adding new nodes to Slurm

Sid Young sid.young at gmail.com
Tue May 4 12:32:19 UTC 2021


You can push a new conf file and issue an "scontrol reconfigure" on the fly
as needed... I do it on our cluster as needed, do the nodes first then
login nodes then the slurm controller... you are making a huge issue of a
very basic task...

Sid


On Tue, 4 May 2021, 22:28 Tina Friedrich, <tina.friedrich at it.ox.ac.uk>
wrote:

> Hello,
>
> a lot of people already gave very good answer to how to tackle this.
>
> Still, I thought it worth pointing this out - you said 'you need to
> basically shut down slurm, update the slurm.conf file, then restart'.
> That makes it sound like a major operation with lots of prep required.
>
> It's not like that at all. Updating slurm.conf is not a major operation.
>
> There's absolutely no reason to shut things down first & then change the
> file. You can edit the file / ship out a new version (however you like)
> and then restart the daemons.
>
> The daemons do not have to all be restarted simultaneously. It is of no
> consequence if they're running with out-of-sync config files for a bit,
> really. (There's a flag you can set if you want to suppress the warning
> - 'NO_CONF_HASH' debug flag I think).
>
> Restarting the dameons (slurmctld, slurmd, ...) is safe. It does not
> require cluster downtime or anything.
>
> I control slurm.conf using configuration management; the config
> management process restarts the appropriate daemon (slurmctld, slurmd,
> slurmdbd) if the file changed. This certainly never happens at the same
> time; there's splay in that. It doesn't even necessarily happen on the
> controller first, or anything like that.
>
> What I'm trying to get across - I have a feeling this 'updating the
> cluster wide config file' and 'file must be the same on all nodes' is a
> lot less of a procedure (and a lot less strict) than you currently
> imagine it to be :)
>
> Tina
>
> On 27/04/2021 19:35, David Henkemeyer wrote:
> > Hello,
> >
> > I'm new to Slurm (coming from PBS), and so I will likely have a few
> > questions over the next several weeks, as I work to transition my
> > infrastructure from PBS to Slurm.
> >
> > My first question has to do with *_adding nodes to Slurm_*.  According
> > to the FAQ (and other articles I've read), you need to basically shut
> > down slurm, update the slurm.conf file /*on all nodes in the cluster*/,
> > then restart slurm.
> >
> > - Why do all nodes need to know about all other nodes?  From what I have
> > read, its Slurm does a checksum comparison of the slurm.conf file across
> > all nodes.  Is this the only reason all nodes need to know about all
> > other nodes?
> > - Can I create a symlink that points <sysconfdir>/slurm.conf to a
> > slurm.conf file on an NFS mount point, which is mounted on all the
> > nodes?  This way, I would only need to update a single file, then
> > restart Slurm across the entire cluster.
> > - Any additional help/resources for adding/removing nodes to Slurm would
> > be much appreciated.  Perhaps there is a "toolkit" out there to automate
> > some of these operations (which is what I already have for PBS, and will
> > create for Slurm, if something doesn't already exist).
> >
> > Thank you all,
> >
> > David
>
> --
> Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
>
> Research Computing and Support Services
> IT Services, University of Oxford
> http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210504/025ee314/attachment.htm>


More information about the slurm-users mailing list