[slurm-users] Controller / backup controller q's

Fri May 25 03:43:59 MDT 2018

Will,  I know I will regret chiming in here. Are you able to say what
cluster manager or framework you are using?
I don't see a problem in running two different distributions. But as Per
says look at your development environment.

For my part, I would ask have you thought about containerisation? ie CentOS
comoute nodes and run Singularity?

ALso the 'unique home directory per node' gives me the heebie-jeebies. I
guess technically is it OK.
However many commercial packages crate dot files or dit directories in user
home directories.
I am thinking of things like Ansys and Matlab etc. etc. etc. here
What will you do if these dotfiles are not consistent across the cluster?

Before anyoen says it, I was arguing somewhere else recently that 'home
directories' are an outdated concept when you are running HPC.
I still think that, and this is a classic case in point.
Forgive me if I have misunderstood your setup.

On 25 May 2018 at 11:30, Pär Lindfors <paran at nsc.liu.se> wrote:

> Hi Will,
>
> On 05/24/2018 05:43 PM, Will Dennis wrote:
> > (we were using CentOS 7.x
> > originally, now the compute nodes are on Ubuntu 16.04.) Currently, we
> > have a single controller (slurmctld) node, an accounting db node>
> (slurmdbd), and 10 compute/worker nodes (slurmd.)
>
> Time to start upgrading to Ubuntu 18.04 now then? :-)
>
> For a 10 node cluster it might make more sense to run slurmctld and
> slurmdbd on the same hardware as neither have very high hardware
> requirements.
>
> On our current clusters we run both services on the same machine. The
> main disadvantage of this is that it makes upgrades inconvenient as it
> prevents upgrading slurmdbd and slurmctld independently. For future
> installations we will probably try running slurmdbd in a VM.
>
> > The problem is that the controller is still running CentOS 7 with our
> > older NFS-mounted /home scheme, but the compute nodes are now all Ubuntu
> > 16.04 with local /home fs’s.
>
> Does each user have a different local home directory on each compute
> node? That is not something I would recommend, unless you are very good
> at training your users to avoid submitting jobs in their home
> directories. I assume you have some other shared file system across the
> cluster?
>
> > 1)      Can we leave the current controller machine on C7 OS, and just
> > have the users log into other machines (that have the same config as the
> > compute nodes) to submit jobs?
> > Or should the controller node really be
> > on the same OS as the compute nodes for some reason?
>
> I recommend separating them, for systems administration and user
> convenience reasons.
>
> With users logged into the the same machine that is running your
> controller or other cluster services, the users can impact the operation
> of the entire cluster when they make mistakes. Typical user mistakes
> involves using all CPU resources, using all memory, filling up or
> overloading filesystems... Much better to have this happen on dedicated
> login machines.
>
> If the login machine uses a different OS than the worker nodes, users
> will also run into problems if they compile software there, as system
> library versions won't match what is available on the compute nodes.
>
> Technically as long as you use the same Slurm version it should work.
> You should however check that your Slurm binaries on different OS are
> build with the exact same features enabled. Many are enabled at compile
> time, so check and compare the output from ./configure.
>
> > 2)      Can I add a backup controller node that runs a different...
> > 3)      What are the steps to replace a primary controller, given that a
> ...
> We are not currently using a backup controller, so I can't answer that
> part.
>
> slurmctld keeps it state files in the directory configured as
> StateSaveLocation, so for slurmctld you typically only need to save the
> configuration files, and that directory. (note this does not include
> munge or the slurmdbd)
>
> Regards,
> Pär Lindfors, NSC
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180525/91858cd1/attachment.html>