<div dir="ltr"><div>Will, I know I will regret chiming in here. Are you able to say what cluster manager or framework you are using?</div><div>I don't see a problem in running two different distributions. But as Per says look at your development environment.</div><div><br></div><div>For my part, I would ask have you thought about containerisation? ie CentOS comoute nodes and run Singularity?</div><div><br></div><div>ALso the 'unique home directory per node' gives me the heebie-jeebies. I guess technically is it OK.</div><div>However many commercial packages crate dot files or dit directories in user home directories.</div><div>I am thinking of things like Ansys and Matlab etc. etc. etc. here</div><div>What will you do if these dotfiles are not consistent across the cluster?</div><div><br></div><div>Before anyoen says it, I was arguing somewhere else recently that 'home directories' are an outdated concept when you are running HPC.</div><div>I still think that, and this is a classic case in point.</div><div>Forgive me if I have misunderstood your setup.</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 25 May 2018 at 11:30, Pär Lindfors <span dir="ltr"><<a href="mailto:paran@nsc.liu.se" target="_blank">paran@nsc.liu.se</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Will,<br>
<span class=""><br>
On 05/24/2018 05:43 PM, Will Dennis wrote:<br>
> (we were using CentOS 7.x<br>
> originally, now the compute nodes are on Ubuntu 16.04.) Currently, we<br>
> have a single controller (slurmctld) node, an accounting db node> (slurmdbd), and 10 compute/worker nodes (slurmd.)<br>
<br>
</span>Time to start upgrading to Ubuntu 18.04 now then? :-)<br>
<br>
For a 10 node cluster it might make more sense to run slurmctld and<br>
slurmdbd on the same hardware as neither have very high hardware<br>
requirements.<br>
<br>
On our current clusters we run both services on the same machine. The<br>
main disadvantage of this is that it makes upgrades inconvenient as it<br>
prevents upgrading slurmdbd and slurmctld independently. For future<br>
installations we will probably try running slurmdbd in a VM.<br>
<span class=""><br>
> The problem is that the controller is still running CentOS 7 with our<br>
> older NFS-mounted /home scheme, but the compute nodes are now all Ubuntu<br>
> 16.04 with local /home fs’s.<br>
<br>
</span>Does each user have a different local home directory on each compute<br>
node? That is not something I would recommend, unless you are very good<br>
at training your users to avoid submitting jobs in their home<br>
directories. I assume you have some other shared file system across the<br>
cluster?<br>
<span class=""><br>
> 1) Can we leave the current controller machine on C7 OS, and just<br>
> have the users log into other machines (that have the same config as the<br>
> compute nodes) to submit jobs?<br>
> Or should the controller node really be<br>
> on the same OS as the compute nodes for some reason?<br>
<br>
</span>I recommend separating them, for systems administration and user<br>
convenience reasons.<br>
<br>
With users logged into the the same machine that is running your<br>
controller or other cluster services, the users can impact the operation<br>
of the entire cluster when they make mistakes. Typical user mistakes<br>
involves using all CPU resources, using all memory, filling up or<br>
overloading filesystems... Much better to have this happen on dedicated<br>
login machines.<br>
<br>
If the login machine uses a different OS than the worker nodes, users<br>
will also run into problems if they compile software there, as system<br>
library versions won't match what is available on the compute nodes.<br>
<br>
Technically as long as you use the same Slurm version it should work.<br>
You should however check that your Slurm binaries on different OS are<br>
build with the exact same features enabled. Many are enabled at compile<br>
time, so check and compare the output from ./configure.<br>
<br>
> 2) Can I add a backup controller node that runs a different...<br>
<span class="">> 3) What are the steps to replace a primary controller, given that a<br>
</span>...<br>
We are not currently using a backup controller, so I can't answer that part.<br>
<br>
slurmctld keeps it state files in the directory configured as<br>
StateSaveLocation, so for slurmctld you typically only need to save the<br>
configuration files, and that directory. (note this does not include<br>
munge or the slurmdbd)<br>
<br>
Regards,<br>
Pär Lindfors, NSC<br>
<br>
</blockquote></div><br></div>