[slurm-users] Controller / backup controller q's

Fri May 25 10:09:36 MDT 2018

No cluster mgr/framework in use... Custom-compiled and packaged the Slurm 16.05.4 release into .rpm/.deb files, and used them to install the different nodes.

Although the homedirs are no longer shared, the nodes do have access to shared storage, one mounted as a subdir of the home directory (which you can symlink stuff from that to the homedir level to “auto-magically” via a conf file that works with a system we designed.) So shared dotfiles, subdirs/files in homedir, etc are all possible.

Have not investigated containerized Slurm setup – will have to put that on the exploration list. If the workloads were Dockerized, I’d probably run them via Kubernetes rather than Slurm...

-Will

From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of John Hearns
Sent: Friday, May 25, 2018 5:44 AM
To: Slurm User Community List
Subject: Re: [slurm-users] Controller / backup controller q's

Will,  I know I will regret chiming in here. Are you able to say what cluster manager or framework you are using?
I don't see a problem in running two different distributions. But as Per says look at your development environment.

For my part, I would ask have you thought about containerisation? ie CentOS comoute nodes and run Singularity?

ALso the 'unique home directory per node' gives me the heebie-jeebies. I guess technically is it OK.
However many commercial packages crate dot files or dit directories in user home directories.
I am thinking of things like Ansys and Matlab etc. etc. etc. here
What will you do if these dotfiles are not consistent across the cluster?

Before anyoen says it, I was arguing somewhere else recently that 'home directories' are an outdated concept when you are running HPC.
I still think that, and this is a classic case in point.
Forgive me if I have misunderstood your setup.

On 25 May 2018 at 11:30, Pär Lindfors <paran at nsc.liu.se<mailto:paran at nsc.liu.se>> wrote:
Hi Will,

On 05/24/2018 05:43 PM, Will Dennis wrote:
> (we were using CentOS 7.x
> originally, now the compute nodes are on Ubuntu 16.04.) Currently, we
> have a single controller (slurmctld) node, an accounting db node> (slurmdbd), and 10 compute/worker nodes (slurmd.)

Time to start upgrading to Ubuntu 18.04 now then? :-)

For a 10 node cluster it might make more sense to run slurmctld and
slurmdbd on the same hardware as neither have very high hardware
requirements.

On our current clusters we run both services on the same machine. The
main disadvantage of this is that it makes upgrades inconvenient as it
prevents upgrading slurmdbd and slurmctld independently. For future
installations we will probably try running slurmdbd in a VM.

> The problem is that the controller is still running CentOS 7 with our
> older NFS-mounted /home scheme, but the compute nodes are now all Ubuntu
> 16.04 with local /home fs’s.

Does each user have a different local home directory on each compute
node? That is not something I would recommend, unless you are very good
at training your users to avoid submitting jobs in their home
directories. I assume you have some other shared file system across the
cluster?

> 1)      Can we leave the current controller machine on C7 OS, and just
> have the users log into other machines (that have the same config as the
> compute nodes) to submit jobs?
> Or should the controller node really be
> on the same OS as the compute nodes for some reason?

I recommend separating them, for systems administration and user
convenience reasons.

With users logged into the the same machine that is running your
controller or other cluster services, the users can impact the operation
of the entire cluster when they make mistakes. Typical user mistakes
involves using all CPU resources, using all memory, filling up or
overloading filesystems... Much better to have this happen on dedicated
login machines.

If the login machine uses a different OS than the worker nodes, users
will also run into problems if they compile software there, as system
library versions won't match what is available on the compute nodes.

Technically as long as you use the same Slurm version it should work.
You should however check that your Slurm binaries on different OS are
build with the exact same features enabled. Many are enabled at compile
time, so check and compare the output from ./configure.

> 2)      Can I add a backup controller node that runs a different...
> 3)      What are the steps to replace a primary controller, given that a
...
We are not currently using a backup controller, so I can't answer that part.

slurmctld keeps it state files in the directory configured as
StateSaveLocation, so for slurmctld you typically only need to save the
configuration files, and that directory. (note this does not include
munge or the slurmdbd)

Regards,
Pär Lindfors, NSC

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180525/4c9288b1/attachment-0001.html>