[slurm-users] Controller / backup controller q's

Thu May 24 09:43:34 MDT 2018

Hi all,

We are building out a new Slurm cluster for a research group here; unfortunately this has taken place over a long period of time, and there's been some architectural changes made in the middle, most importantly the host OS on the Slurm nodes (we were using CentOS 7.x originally, now the compute nodes are on Ubuntu 16.04.) Currently, we have a single controller (slurmctld) node, an accounting db node (slurmdbd), and 10 compute/worker nodes (slurmd.)

The problem is that the controller is still running CentOS 7 with our older NFS-mounted /home scheme, but the compute nodes are now all Ubuntu 16.04 with local /home fs's. Currently (still in testing mode here), the users log into the controller node to submit jobs, but of course that's now a completely different OS/lib environment than on the compute nodes. (They cannot log into the compute nodes unless they have a job currently running on them, as we have implemented the 'pam_slurm.so' PAM module on the compute nodes.)

My questions are these:

1)      Can we leave the current controller machine on C7 OS, and just have the users log into other machines (that have the same config as the compute nodes) to submit jobs? Or should the controller node really be on the same OS as the compute nodes for some reason?

2)      Can I add a backup controller node that runs a different environment (i.e. like the compute node environment) than the primary controller node? Or should (must) it be the same as the primary controller node?

3)      What are the steps to replace a primary controller, given that a backup controller exists? (Hopefully this is already documented somewhere that I haven't found yet)

Thanks,
Will
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180524/0cf5dd40/attachment-0001.html>