[slurm-users] cluster reconfigure

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Tue Jun 9 12:28:47 UTC 2020


On 6/9/20 12:12 PM, Steve Brasier wrote:
> Hi all, looking for some advice on the process to following when doing one 
> of the reconfigurations which requires a slurm daemon restart (as listed 
> in docs for "scontrol reconfigure").

When reconfiguring slurm.conf, make sure to propagate that file to all 
nodes first!

The scontrol manual page explains when a restart of the daemons (and not 
just "scontrol reconfig") is required:

reconfigure
     Instruct all Slurm daemons to re-read the configuration file. This 
command does not restart the daemons. This mechanism would be used to 
modify configuration parameters (Epilog, Prolog, SlurmctldLogFile, 
SlurmdLogFile, etc.). The Slurm controller (slurmctld) forwards the 
request all other daemons (slurmd daemon on each compute node). Running 
jobs continue execution. Most configuration parameters can be changed by 
just running this command, however, Slurm daemons should be shutdown and 
restarted if any of these parameters are to be changed: AuthType, 
ControlMach, PluginDir, StateSaveLocation, SlurmctldHost, SlurmctldPort, 
or SlurmdPort. The slurmctld daemon and all slurmd daemons must be 
restarted if nodes are added to or removed from the cluster.


> In this situation, is there any difference in terms of preservation of 
> slurm's state etc between using "scontrol shutdown" or running "service 
> slurmd/slurmctld stop" on each node?

The slurmctld state is preserved in the server's StateSaveLocation:

# scontrol show config | grep StateSaveLocation
StateSaveLocation       = /var/spool/slurmctld

It is essential not to disturb that folder!  Make a backup after stopping 
slurmctld, just in case...

> Is there a recommended order in which to shutdown and restart daemons?

Why do you want to shutdown/restart in the first place?  I think you can 
restart any daemon if necessary, but you have to consider Slurm's timeout 
parameters SlurmctldTimeout and SlurmdTimeout:

# scontrol show config | grep Timeout

If any daemon is down for a longer time, things will start failing!

Best regards,
Ole



More information about the slurm-users mailing list