<div dir="ltr"><div>Will,  I know I will regret chiming in here. Are you able to say what cluster manager or framework you are using?</div><div>I don't see a problem in running two different distributions. But as Per says look at your development environment.</div><div><br></div><div>For my part, I would ask have you thought about containerisation? ie CentOS comoute nodes and run Singularity?</div><div><br></div><div>ALso the 'unique home directory per node' gives me the heebie-jeebies. I guess technically is it OK.</div><div>However many commercial packages crate dot files or dit directories in user home directories.</div><div>I am thinking of things like Ansys and Matlab etc. etc. etc. here</div><div>What will you do if these dotfiles are not consistent across the cluster?</div><div><br></div><div>Before anyoen says it, I was arguing somewhere else recently that 'home directories' are an outdated concept when you are running HPC.</div><div>I still think that, and this is a classic case in point.</div><div>Forgive me if I have misunderstood your setup.</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 25 May 2018 at 11:30, Pär Lindfors <span dir="ltr"><<a href="mailto:paran@nsc.liu.se" target="_blank">paran@nsc.liu.se</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Will,<br>

<span class=""><br>

On 05/24/2018 05:43 PM, Will Dennis wrote:<br>

> (we were using CentOS 7.x<br>

> originally, now the compute nodes are on Ubuntu 16.04.) Currently, we<br>

> have a single controller (slurmctld) node, an accounting db node> (slurmdbd), and 10 compute/worker nodes (slurmd.)<br>

<br>

</span>Time to start upgrading to Ubuntu 18.04 now then? :-)<br>

<br>

For a 10 node cluster it might make more sense to run slurmctld and<br>

slurmdbd on the same hardware as neither have very high hardware<br>

requirements.<br>

<br>

On our current clusters we run both services on the same machine. The<br>

main disadvantage of this is that it makes upgrades inconvenient as it<br>

prevents upgrading slurmdbd and slurmctld independently. For future<br>

installations we will probably try running slurmdbd in a VM.<br>

<span class=""><br>

> The problem is that the controller is still running CentOS 7 with our<br>

> older NFS-mounted /home scheme, but the compute nodes are now all Ubuntu<br>

> 16.04 with local /home fs’s.<br>

<br>

</span>Does each user have a different local home directory on each compute<br>

node? That is not something I would recommend, unless you are very good<br>

at training your users to avoid submitting jobs in their home<br>

directories. I assume you have some other shared file system across the<br>

cluster?<br>

<span class=""><br>

> 1)      Can we leave the current controller machine on C7 OS, and just<br>

> have the users log into other machines (that have the same config as the<br>

> compute nodes) to submit jobs?<br>

> Or should the controller node really be<br>

> on the same OS as the compute nodes for some reason?<br>

<br>

</span>I recommend separating them, for systems administration and user<br>

convenience reasons.<br>

<br>

With users logged into the the same machine that is running your<br>

controller or other cluster services, the users can impact the operation<br>

of the entire cluster when they make mistakes. Typical user mistakes<br>

involves using all CPU resources, using all memory, filling up or<br>

overloading filesystems... Much better to have this happen on dedicated<br>

login machines.<br>

<br>

If the login machine uses a different OS than the worker nodes, users<br>

will also run into problems if they compile software there, as system<br>

library versions won't match what is available on the compute nodes.<br>

<br>

Technically as long as you use the same Slurm version it should work.<br>

You should however check that your Slurm binaries on different OS are<br>

build with the exact same features enabled. Many are enabled at compile<br>

time, so check and compare the output from ./configure.<br>

<br>

> 2)      Can I add a backup controller node that runs a different...<br>

<span class="">> 3)      What are the steps to replace a primary controller, given that a<br>

</span>...<br>

We are not currently using a backup controller, so I can't answer that part.<br>

<br>

slurmctld keeps it state files in the directory configured as<br>

StateSaveLocation, so for slurmctld you typically only need to save the<br>

configuration files, and that directory. (note this does not include<br>

munge or the slurmdbd)<br>

<br>

Regards,<br>

Pär Lindfors, NSC<br>

<br>

</blockquote></div><br></div>