[slurm-users] cluster reconfigure
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Tue Jun 9 12:28:47 UTC 2020
On 6/9/20 12:12 PM, Steve Brasier wrote:
> Hi all, looking for some advice on the process to following when doing one
> of the reconfigurations which requires a slurm daemon restart (as listed
> in docs for "scontrol reconfigure").
When reconfiguring slurm.conf, make sure to propagate that file to all
nodes first!
The scontrol manual page explains when a restart of the daemons (and not
just "scontrol reconfig") is required:
reconfigure
Instruct all Slurm daemons to re-read the configuration file. This
command does not restart the daemons. This mechanism would be used to
modify configuration parameters (Epilog, Prolog, SlurmctldLogFile,
SlurmdLogFile, etc.). The Slurm controller (slurmctld) forwards the
request all other daemons (slurmd daemon on each compute node). Running
jobs continue execution. Most configuration parameters can be changed by
just running this command, however, Slurm daemons should be shutdown and
restarted if any of these parameters are to be changed: AuthType,
ControlMach, PluginDir, StateSaveLocation, SlurmctldHost, SlurmctldPort,
or SlurmdPort. The slurmctld daemon and all slurmd daemons must be
restarted if nodes are added to or removed from the cluster.
> In this situation, is there any difference in terms of preservation of
> slurm's state etc between using "scontrol shutdown" or running "service
> slurmd/slurmctld stop" on each node?
The slurmctld state is preserved in the server's StateSaveLocation:
# scontrol show config | grep StateSaveLocation
StateSaveLocation = /var/spool/slurmctld
It is essential not to disturb that folder! Make a backup after stopping
slurmctld, just in case...
> Is there a recommended order in which to shutdown and restart daemons?
Why do you want to shutdown/restart in the first place? I think you can
restart any daemon if necessary, but you have to consider Slurm's timeout
parameters SlurmctldTimeout and SlurmdTimeout:
# scontrol show config | grep Timeout
If any daemon is down for a longer time, things will start failing!
Best regards,
Ole
More information about the slurm-users
mailing list