[slurm-users] Controller / backup controller q's

Fri May 25 03:30:32 MDT 2018

Hi Will,

On 05/24/2018 05:43 PM, Will Dennis wrote:
> (we were using CentOS 7.x
> originally, now the compute nodes are on Ubuntu 16.04.) Currently, we
> have a single controller (slurmctld) node, an accounting db node> (slurmdbd), and 10 compute/worker nodes (slurmd.)

Time to start upgrading to Ubuntu 18.04 now then? :-)

For a 10 node cluster it might make more sense to run slurmctld and
slurmdbd on the same hardware as neither have very high hardware
requirements.

On our current clusters we run both services on the same machine. The
main disadvantage of this is that it makes upgrades inconvenient as it
prevents upgrading slurmdbd and slurmctld independently. For future
installations we will probably try running slurmdbd in a VM.

> The problem is that the controller is still running CentOS 7 with our
> older NFS-mounted /home scheme, but the compute nodes are now all Ubuntu
> 16.04 with local /home fs’s.

Does each user have a different local home directory on each compute
node? That is not something I would recommend, unless you are very good
at training your users to avoid submitting jobs in their home
directories. I assume you have some other shared file system across the
cluster?

> 1)      Can we leave the current controller machine on C7 OS, and just
> have the users log into other machines (that have the same config as the
> compute nodes) to submit jobs?
> Or should the controller node really be
> on the same OS as the compute nodes for some reason?

I recommend separating them, for systems administration and user
convenience reasons.

With users logged into the the same machine that is running your
controller or other cluster services, the users can impact the operation
of the entire cluster when they make mistakes. Typical user mistakes
involves using all CPU resources, using all memory, filling up or
overloading filesystems... Much better to have this happen on dedicated
login machines.

If the login machine uses a different OS than the worker nodes, users
will also run into problems if they compile software there, as system
library versions won't match what is available on the compute nodes.

Technically as long as you use the same Slurm version it should work.
You should however check that your Slurm binaries on different OS are
build with the exact same features enabled. Many are enabled at compile
time, so check and compare the output from ./configure.

> 2)      Can I add a backup controller node that runs a different...
> 3)      What are the steps to replace a primary controller, given that a
...
We are not currently using a backup controller, so I can't answer that part.

slurmctld keeps it state files in the directory configured as
StateSaveLocation, so for slurmctld you typically only need to save the
configuration files, and that directory. (note this does not include
munge or the slurmdbd)

Regards,
Pär Lindfors, NSC