[slurm-users] Questions about adding new nodes to Slurm

Tue May 4 13:51:37 UTC 2021

The task of adding or removing nodes from Slurm is well documented and 
discussed in SchedMD presentations, please see my Wiki page 
https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes

/Ole


On 04-05-2021 14:47, Tina Friedrich wrote:
> Not sure if that's changed but aren't there cases where 'scontrol 
> reconfigure' isn't sufficient? (Like adding nodes?)
> 
> But yes, that's my point exactly; it is a pretty basic day to day task 
> to update slurm.conf, not some daunting operation that requires a 
> downtime or anything like it. (I remember this requirement to update the 
> config file everywhere & restart everything sounding like a major task 
> that requires announcements & downtimes to me when I started with SLURM 
> - coming from Grid Engine - and it took me while to figure out, and 
> trust, that an update to slurm.conf is a very minor task, and not a 
> risky one really :) ))
> 
> Tina
> 
> On 04/05/2021 13:32, Sid Young wrote:
>> You can push a new conf file and issue an "scontrol reconfigure" on 
>> the fly as needed... I do it on our cluster as needed, do the nodes 
>> first then login nodes then the slurm controller... you are making a 
>> huge issue of a very basic task...
>>
>> Sid
>>
>>
>> On Tue, 4 May 2021, 22:28 Tina Friedrich, <tina.friedrich at it.ox.ac.uk 
>> <mailto:tina.friedrich at it.ox.ac.uk>> wrote:
>>
>>     Hello,
>>
>>     a lot of people already gave very good answer to how to tackle this.
>>
>>     Still, I thought it worth pointing this out - you said 'you need to
>>     basically shut down slurm, update the slurm.conf file, then restart'.
>>     That makes it sound like a major operation with lots of prep 
>> required.
>>
>>     It's not like that at all. Updating slurm.conf is not a major 
>> operation.
>>
>>     There's absolutely no reason to shut things down first & then change
>>     the
>>     file. You can edit the file / ship out a new version (however you 
>> like)
>>     and then restart the daemons.
>>
>>     The daemons do not have to all be restarted simultaneously. It is 
>> of no
>>     consequence if they're running with out-of-sync config files for a 
>> bit,
>>     really. (There's a flag you can set if you want to suppress the 
>> warning
>>     - 'NO_CONF_HASH' debug flag I think).
>>
>>     Restarting the dameons (slurmctld, slurmd, ...) is safe. It does not
>>     require cluster downtime or anything.
>>
>>     I control slurm.conf using configuration management; the config
>>     management process restarts the appropriate daemon (slurmctld, 
>> slurmd,
>>     slurmdbd) if the file changed. This certainly never happens at the 
>> same
>>     time; there's splay in that. It doesn't even necessarily happen on 
>> the
>>     controller first, or anything like that.
>>
>>     What I'm trying to get across - I have a feeling this 'updating the
>>     cluster wide config file' and 'file must be the same on all nodes' 
>> is a
>>     lot less of a procedure (and a lot less strict) than you currently
>>     imagine it to be :)
>>
>>     Tina
>>
>>     On 27/04/2021 19:35, David Henkemeyer wrote:
>>      > Hello,
>>      >
>>      > I'm new to Slurm (coming from PBS), and so I will likely have a 
>> few
>>      > questions over the next several weeks, as I work to transition my
>>      > infrastructure from PBS to Slurm.
>>      >
>>      > My first question has to do with *_adding nodes to Slurm_*.     
>> According
>>      > to the FAQ (and other articles I've read), you need to basically
>>     shut
>>      > down slurm, update the slurm.conf file /*on all nodes in the
>>     cluster*/,
>>      > then restart slurm.
>>      >
>>      > - Why do all nodes need to know about all other nodes?  From what
>>     I have
>>      > read, its Slurm does a checksum comparison of the slurm.conf file
>>     across
>>      > all nodes.  Is this the only reason all nodes need to know 
>> about all
>>      > other nodes?
>>      > - Can I create a symlink that points <sysconfdir>/slurm.conf to a
>>      > slurm.conf file on an NFS mount point, which is mounted on all the
>>      > nodes?  This way, I would only need to update a single file, then
>>      > restart Slurm across the entire cluster.
>>      > - Any additional help/resources for adding/removing nodes to
>>     Slurm would
>>      > be much appreciated.  Perhaps there is a "toolkit" out there to
>>     automate
>>      > some of these operations (which is what I already have for PBS,
>>     and will
>>      > create for Slurm, if something doesn't already exist).
>>      >
>>      > Thank you all,
>>      >
>>      > David
>>
>>     --     Tina Friedrich, Advanced Research Computing Snr HPC Systems
>>     Administrator