<div dir="ltr"><div dir="ltr"><div>Hi,</div><div><br></div><div>If it helps, this is our setup:</div><div>6 clusters (actually a bit more)<br></div><div>1 mysql + slurmdbd on the same host </div><div>6 primary slurmctld on 3 hosts (need to make sure each have a distinct SlurmctldPort)</div><div>6 secondary slurmctld on an arbitrary node on the clusters themselves.<br></div><div>1 login node per cluster (this is a very small VM, and the users are limited both to cpu time (with ulimit) and memory (with systemd))</div><div>The slurm.conf's are shared on nfs to everyone in /path/to/nfs/<cluster name>/slurm.conf. With symlink to /etc for the relevant cluster per node.<br></div><div><br></div><div>The -M generally works, we can submit/query jobs from a login node of one cluster to another. But there's a caveat to notice when upgrading. slurmdbd must be upgraded first, but usually we have a not so small gap between upgrading the different clusters. This causes the -M to stop working because binaries of one version won't work on the other (I don't remember in which direction).</div><div>We solved this by using an lmod module per cluster, which both sets the SLURM_CONF environment, and the PATH to the correct slurm binaries (which we install in /usr/local/slurm/<version>/ so that they co-exists). So when the -M won't work, users can use:</div><div>module load slurm/clusterA</div><div>squeue</div><div>module load slurm/clusterB</div><div>squeue</div><div><br></div><div>BR,<br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Oct 28, 2021 at 7:39 PM navin srivastava <<a href="mailto:navin.altair@gmail.com">navin.altair@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto">Thank you Tina. <div dir="auto">It will really help</div><div dir="auto"><br></div><div dir="auto">Regards </div><div dir="auto">Navin </div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Oct 28, 2021, 22:01 Tina Friedrich <<a href="mailto:tina.friedrich@it.ox.ac.uk" target="_blank">tina.friedrich@it.ox.ac.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello,<br>
<br>
I have the database on a separate server (it runs the database and the <br>
database only). The login nodes run nothing SLURM related, they simply <br>
have the binaries installed & a SLURM config.<br>
<br>
I've never looked into having multiple databases & using <br>
AccountingStorageExternalHost (in fact I'd forgotten you could do that), <br>
so I can't comment on that (maybe someone else can); I think that works, <br>
yes, but as I said never tested that (didn't see much point in running <br>
multiple databases if one would do the job).<br>
<br>
I actually have specific login nodes for both of my clusters, to make it <br>
easier for users (especially those with not much experience using the <br>
HPC environment); so I have one login node connecting to cluster 1 and <br>
one connecting to cluster 1.<br>
<br>
I think the relevant bits of slurm.conf Relevant config entries (if I'm <br>
not mistaken) on the login nodes are probably:<br>
<br>
The differences in the slurm config files (that haven't got to do with <br>
topology & nodes & scheduler tuning) are<br>
<br>
ClusterName=cluster1<br>
ControlMachine=cluster1-slurm<br>
ControlAddr=/IP_OF_SLURM_CONTROLLER/<br>
<br>
ClusterName=cluster2<br>
ControlMachine=cluster2-slurm<br>
ControlAddr=/IP_OF_SLURM_CONTROLLER/<br>
<br>
(where IP_OF_SLURM_CONTROLLER is the IP address of host cluster1-slurm, <br>
same for cluster2)<br>
<br>
And then the have common entries for the AccountingStorageHost:<br>
<br>
AccountingStorageHost=slurm-db-prod<br>
AccountingStorageBackupHost=slurm-db-prod<br>
AccountingStoragePort=7030<br>
AccountingStorageType=accounting_storage/slurmdbd<br>
<br>
(slurm-db-prod is simply the hostname of the SLURM database server)<br>
<br>
Does that help?<br>
<br>
Tina<br>
<br>
On 28/10/2021 14:59, navin srivastava wrote:<br>
> Thank you Tina.<br>
> <br>
> so if i understood correctly.Database is global to both cluster and <br>
> running on login Node?<br>
> or is the database running on one of the master Node and shared with <br>
> another master server Node?<br>
> <br>
> but as far I have read that the slurm database can also be separate on <br>
> both the master and just use the parameter <br>
> AccountingStorageExternalHost so that both databases are aware of each <br>
> other.<br>
> <br>
> Also on the login node in slurm .conf file pointed to which Slurmctld?<br>
> is it possible to share the sample slurm.conf file of login Node.<br>
> <br>
> Regards<br>
> Navin.<br>
> <br>
> <br>
> <br>
> <br>
> <br>
> <br>
> <br>
> <br>
> On Thu, Oct 28, 2021 at 7:06 PM Tina Friedrich <br>
> <<a href="mailto:tina.friedrich@it.ox.ac.uk" rel="noreferrer" target="_blank">tina.friedrich@it.ox.ac.uk</a> <mailto:<a href="mailto:tina.friedrich@it.ox.ac.uk" rel="noreferrer" target="_blank">tina.friedrich@it.ox.ac.uk</a>>> wrote:<br>
> <br>
> Hi Navin,<br>
> <br>
> well, I have two clusters & login nodes that allow access to both. That<br>
> do? I don't think a third would make any difference in setup.<br>
> <br>
> They need to share a database. As long as the share a database, the<br>
> clusters have 'knowledge' of each other.<br>
> <br>
> So if you set up one database server (running slurmdbd), and then a<br>
> SLURM controller for each cluster (running slurmctld) using that one<br>
> central database, the '-M' option should work.<br>
> <br>
> Tina<br>
> <br>
> On 28/10/2021 10:54, navin srivastava wrote:<br>
> > Hi ,<br>
> ><br>
> > I am looking for a stepwise guide to setup multi cluster<br>
> implementation.<br>
> > We wanted to set up 3 clusters and one Login Node to run the job<br>
> using<br>
> > -M cluster option.<br>
> > can anybody have such a setup and can share some insight into how it<br>
> > works and it is really a stable solution.<br>
> ><br>
> ><br>
> > Regards<br>
> > Navin.<br>
> <br>
> -- <br>
> Tina Friedrich, Advanced Research Computing Snr HPC Systems<br>
> Administrator<br>
> <br>
> Research Computing and Support Services<br>
> IT Services, University of Oxford<br>
> <a href="http://www.arc.ox.ac.uk" rel="noreferrer noreferrer" target="_blank">http://www.arc.ox.ac.uk</a> <<a href="http://www.arc.ox.ac.uk" rel="noreferrer noreferrer" target="_blank">http://www.arc.ox.ac.uk</a>><br>
> <a href="http://www.it.ox.ac.uk" rel="noreferrer noreferrer" target="_blank">http://www.it.ox.ac.uk</a> <<a href="http://www.it.ox.ac.uk" rel="noreferrer noreferrer" target="_blank">http://www.it.ox.ac.uk</a>><br>
> <br>
<br>
-- <br>
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator<br>
<br>
Research Computing and Support Services<br>
IT Services, University of Oxford<br>
<a href="http://www.arc.ox.ac.uk" rel="noreferrer noreferrer" target="_blank">http://www.arc.ox.ac.uk</a> <a href="http://www.it.ox.ac.uk" rel="noreferrer noreferrer" target="_blank">http://www.it.ox.ac.uk</a><br>
<br>
</blockquote></div>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">
<div>
<pre style="font-family:monospace"> <span style="color:rgb(133,12,27)">/|</span> |
<span style="color:rgb(133,12,27)">\/</span> | <span style="color:rgb(51,88,104);font-weight:bold">Yair Yarom </span><span style="color:rgb(51,88,104)">| System Group (DevOps)</span>
<span style="color:rgb(92,181,149)">[]</span> | <span style="color:rgb(51,88,104);font-weight:bold">The Rachel and Selim Benin School</span>
<span style="color:rgb(92,181,149)">[]</span> <span style="color:rgb(133,12,27)">/\</span> | <span style="color:rgb(51,88,104);font-weight:bold">of Computer Science and Engineering</span>
<span style="color:rgb(92,181,149)">[]</span><span style="color:rgb(0,161,146)">//</span><span style="color:rgb(133,12,27)">\</span><span style="color:rgb(133,12,27)">\</span><span style="color:rgb(49,154,184)">/</span> | <span style="color:rgb(51,88,104)">The Hebrew University of Jerusalem</span>
<span style="color:rgb(92,181,149)">[</span><span style="color:rgb(1,84,76)">/</span><span style="color:rgb(0,161,146)">/</span> <span style="color:rgb(41,16,22)">\</span><span style="color:rgb(41,16,22)">\</span> | <span style="color:rgb(51,88,104)">T +972-2-5494522 | F +972-2-5494522</span>
<span style="color:rgb(1,84,76)">//</span> <span style="color:rgb(21,122,134)">\</span> | <span style="color:rgb(51,88,104)"><a href="mailto:irush@cs.huji.ac.il" target="_blank">irush@cs.huji.ac.il</a></span>
<span style="color:rgb(127,130,103)">/</span><span style="color:rgb(1,84,76)">/</span> |
</pre>
</div>
</div></div></div>