<div dir="ltr"><div><br></div><div>cpu limit using ulimit is pretty straightforward with pam_limits and /etc/security/limits.conf. On some of the login nodes we have a cpu limit of 10 minutes, so heavy processes will fail.<br></div><div><br></div><div>The memory was a bit more complicated (i.e. not pretty). We wanted that a user won't be able to use more than e.g. 1G for all processes combined. Using systemd we added the file /etc/systemd/system/user-.slice.d/20-memory.conf which contains:</div><div>[Slice]<br>MemoryLimit=1024M<br>MemoryAccounting=true</div><div><br></div><div>But we also wanted to restrict swap usage and we're still on cgroupv1, so systemd didn't help there. The ugly part comes with a pam_exec to a script that updates the memsw limit of the cgroup for the above slice. The script does more things, but the swap section is more or less:<br></div><div><br></div><div>if [ "x$PAM_TYPE" = 'xopen_session' ]; then</div><div> _id=`id -u $PAM_USER`<br> if [ -z "$_id" ]; then<br> exit 1<br> fi</div><div> if [[ -e /sys/fs/cgroup/memory/user.slice/user-${_id}.slice/memory.memsw.limit_in_bytes ]]; then</div><div> swap=$((1126 * 1024 * 1024))<br></div><div> echo $swap > /sys/fs/cgroup/memory/user.slice/user-${_id}.slice/memory.memsw.limit_in_bytes<br></div><div> fi</div><div>fi<br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Oct 31, 2021 at 6:36 PM Brian Andrus <<a href="mailto:toomuchit@gmail.com">toomuchit@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>That is interesting to me.</p>
<p>How do you use ulimit and systemd to limit user usage on the
login nodes? This sounds like something very useful.</p>
<p>Brian Andrus<br>
</p>
<div>On 10/31/2021 1:08 AM, Yair Yarom
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">
<div>Hi,</div>
<div><br>
</div>
<div>If it helps, this is our setup:</div>
<div>6 clusters (actually a bit more)<br>
</div>
<div>1 mysql + slurmdbd on the same host </div>
<div>6 primary slurmctld on 3 hosts (need to make sure each
have a distinct SlurmctldPort)</div>
<div>6 secondary slurmctld on an arbitrary node on the
clusters themselves.<br>
</div>
<div>1 login node per cluster (this is a very small VM, and
the users are limited both to cpu time (with ulimit) and
memory (with systemd))</div>
<div>The slurm.conf's are shared on nfs to everyone in
/path/to/nfs/<cluster name>/slurm.conf. With symlink
to /etc for the relevant cluster per node.<br>
</div>
<div><br>
</div>
<div>The -M generally works, we can submit/query jobs from a
login node of one cluster to another. But there's a caveat
to notice when upgrading. slurmdbd must be upgraded first,
but usually we have a not so small gap between upgrading the
different clusters. This causes the -M to stop working
because binaries of one version won't work on the other (I
don't remember in which direction).</div>
<div>We solved this by using an lmod module per cluster, which
both sets the SLURM_CONF environment, and the PATH to the
correct slurm binaries (which we install in
/usr/local/slurm/<version>/ so that they co-exists).
So when the -M won't work, users can use:</div>
<div>module load slurm/clusterA</div>
<div>squeue</div>
<div>module load slurm/clusterB</div>
<div>squeue</div>
<div><br>
</div>
<div>BR,<br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Oct 28, 2021 at 7:39
PM navin srivastava <<a href="mailto:navin.altair@gmail.com" target="_blank">navin.altair@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="auto">Thank you Tina.
<div dir="auto">It will really help</div>
<div dir="auto"><br>
</div>
<div dir="auto">Regards </div>
<div dir="auto">Navin </div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Oct 28, 2021,
22:01 Tina Friedrich <<a href="mailto:tina.friedrich@it.ox.ac.uk" target="_blank">tina.friedrich@it.ox.ac.uk</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello,<br>
<br>
I have the database on a separate server (it runs the
database and the <br>
database only). The login nodes run nothing SLURM
related, they simply <br>
have the binaries installed & a SLURM config.<br>
<br>
I've never looked into having multiple databases &
using <br>
AccountingStorageExternalHost (in fact I'd forgotten you
could do that), <br>
so I can't comment on that (maybe someone else can); I
think that works, <br>
yes, but as I said never tested that (didn't see much
point in running <br>
multiple databases if one would do the job).<br>
<br>
I actually have specific login nodes for both of my
clusters, to make it <br>
easier for users (especially those with not much
experience using the <br>
HPC environment); so I have one login node connecting to
cluster 1 and <br>
one connecting to cluster 1.<br>
<br>
I think the relevant bits of slurm.conf Relevant config
entries (if I'm <br>
not mistaken) on the login nodes are probably:<br>
<br>
The differences in the slurm config files (that haven't
got to do with <br>
topology & nodes & scheduler tuning) are<br>
<br>
ClusterName=cluster1<br>
ControlMachine=cluster1-slurm<br>
ControlAddr=/IP_OF_SLURM_CONTROLLER/<br>
<br>
ClusterName=cluster2<br>
ControlMachine=cluster2-slurm<br>
ControlAddr=/IP_OF_SLURM_CONTROLLER/<br>
<br>
(where IP_OF_SLURM_CONTROLLER is the IP address of host
cluster1-slurm, <br>
same for cluster2)<br>
<br>
And then the have common entries for the
AccountingStorageHost:<br>
<br>
AccountingStorageHost=slurm-db-prod<br>
AccountingStorageBackupHost=slurm-db-prod<br>
AccountingStoragePort=7030<br>
AccountingStorageType=accounting_storage/slurmdbd<br>
<br>
(slurm-db-prod is simply the hostname of the SLURM
database server)<br>
<br>
Does that help?<br>
<br>
Tina<br>
<br>
On 28/10/2021 14:59, navin srivastava wrote:<br>
> Thank you Tina.<br>
> <br>
> so if i understood correctly.Database is global to
both cluster and <br>
> running on login Node?<br>
> or is the database running on one of the master
Node and shared with <br>
> another master server Node?<br>
> <br>
> but as far I have read that the slurm database can
also be separate on <br>
> both the master and just use the parameter <br>
> AccountingStorageExternalHost so that both
databases are aware of each <br>
> other.<br>
> <br>
> Also on the login node in slurm .conf file pointed
to which Slurmctld?<br>
> is it possible to share the sample slurm.conf file
of login Node.<br>
> <br>
> Regards<br>
> Navin.<br>
> <br>
> <br>
> <br>
> <br>
> <br>
> <br>
> <br>
> <br>
> On Thu, Oct 28, 2021 at 7:06 PM Tina Friedrich <br>
> <<a href="mailto:tina.friedrich@it.ox.ac.uk" rel="noreferrer" target="_blank">tina.friedrich@it.ox.ac.uk</a>
<mailto:<a href="mailto:tina.friedrich@it.ox.ac.uk" rel="noreferrer" target="_blank">tina.friedrich@it.ox.ac.uk</a>>>
wrote:<br>
> <br>
> Hi Navin,<br>
> <br>
> well, I have two clusters & login nodes
that allow access to both. That<br>
> do? I don't think a third would make any
difference in setup.<br>
> <br>
> They need to share a database. As long as the
share a database, the<br>
> clusters have 'knowledge' of each other.<br>
> <br>
> So if you set up one database server (running
slurmdbd), and then a<br>
> SLURM controller for each cluster (running
slurmctld) using that one<br>
> central database, the '-M' option should work.<br>
> <br>
> Tina<br>
> <br>
> On 28/10/2021 10:54, navin srivastava wrote:<br>
> > Hi ,<br>
> ><br>
> > I am looking for a stepwise guide to
setup multi cluster<br>
> implementation.<br>
> > We wanted to set up 3 clusters and one
Login Node to run the job<br>
> using<br>
> > -M cluster option.<br>
> > can anybody have such a setup and can
share some insight into how it<br>
> > works and it is really a stable solution.<br>
> ><br>
> ><br>
> > Regards<br>
> > Navin.<br>
> <br>
> -- <br>
> Tina Friedrich, Advanced Research Computing Snr
HPC Systems<br>
> Administrator<br>
> <br>
> Research Computing and Support Services<br>
> IT Services, University of Oxford<br>
> <a href="http://www.arc.ox.ac.uk" rel="noreferrer noreferrer" target="_blank">http://www.arc.ox.ac.uk</a>
<<a href="http://www.arc.ox.ac.uk" rel="noreferrer
noreferrer" target="_blank">http://www.arc.ox.ac.uk</a>><br>
> <a href="http://www.it.ox.ac.uk" rel="noreferrer noreferrer" target="_blank">http://www.it.ox.ac.uk</a>
<<a href="http://www.it.ox.ac.uk" rel="noreferrer
noreferrer" target="_blank">http://www.it.ox.ac.uk</a>><br>
> <br>
<br>
-- <br>
Tina Friedrich, Advanced Research Computing Snr HPC
Systems Administrator<br>
<br>
Research Computing and Support Services<br>
IT Services, University of Oxford<br>
<a href="http://www.arc.ox.ac.uk" rel="noreferrer
noreferrer" target="_blank">http://www.arc.ox.ac.uk</a>
<a href="http://www.it.ox.ac.uk" rel="noreferrer
noreferrer" target="_blank">http://www.it.ox.ac.uk</a><br>
<br>
</blockquote>
</div>
</blockquote>
</div>
<br clear="all">
<br>
-- <br>
<div dir="ltr">
<div dir="ltr">
<div>
<pre style="font-family:monospace"> <span style="color:rgb(133,12,27)">/|</span> |
<span style="color:rgb(133,12,27)">\/</span> | <span style="color:rgb(51,88,104);font-weight:bold">Yair Yarom </span><span style="color:rgb(51,88,104)">| System Group (DevOps)</span>
<span style="color:rgb(92,181,149)">[]</span> | <span style="color:rgb(51,88,104);font-weight:bold">The Rachel and Selim Benin School</span>
<span style="color:rgb(92,181,149)">[]</span> <span style="color:rgb(133,12,27)">/\</span> | <span style="color:rgb(51,88,104);font-weight:bold">of Computer Science and Engineering</span>
<span style="color:rgb(92,181,149)">[]</span><span style="color:rgb(0,161,146)">//</span><span style="color:rgb(133,12,27)">\</span><span style="color:rgb(133,12,27)">\</span><span style="color:rgb(49,154,184)">/</span> | <span style="color:rgb(51,88,104)">The Hebrew University of Jerusalem</span>
<span style="color:rgb(92,181,149)">[</span><span style="color:rgb(1,84,76)">/</span><span style="color:rgb(0,161,146)">/</span> <span style="color:rgb(41,16,22)">\</span><span style="color:rgb(41,16,22)">\</span> | <span style="color:rgb(51,88,104)">T +972-2-5494522 | F +972-2-5494522</span>
<span style="color:rgb(1,84,76)">//</span> <span style="color:rgb(21,122,134)">\</span> | <span style="color:rgb(51,88,104)"><a href="mailto:irush@cs.huji.ac.il" target="_blank">irush@cs.huji.ac.il</a></span>
<span style="color:rgb(127,130,103)">/</span><span style="color:rgb(1,84,76)">/</span> |
</pre>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">
<div>
<pre style="font-family:monospace"> <span style="color:rgb(133,12,27)">/|</span> |
<span style="color:rgb(133,12,27)">\/</span> | <span style="color:rgb(51,88,104);font-weight:bold">Yair Yarom </span><span style="color:rgb(51,88,104)">| System Group (DevOps)</span>
<span style="color:rgb(92,181,149)">[]</span> | <span style="color:rgb(51,88,104);font-weight:bold">The Rachel and Selim Benin School</span>
<span style="color:rgb(92,181,149)">[]</span> <span style="color:rgb(133,12,27)">/\</span> | <span style="color:rgb(51,88,104);font-weight:bold">of Computer Science and Engineering</span>
<span style="color:rgb(92,181,149)">[]</span><span style="color:rgb(0,161,146)">//</span><span style="color:rgb(133,12,27)">\</span><span style="color:rgb(133,12,27)">\</span><span style="color:rgb(49,154,184)">/</span> | <span style="color:rgb(51,88,104)">The Hebrew University of Jerusalem</span>
<span style="color:rgb(92,181,149)">[</span><span style="color:rgb(1,84,76)">/</span><span style="color:rgb(0,161,146)">/</span> <span style="color:rgb(41,16,22)">\</span><span style="color:rgb(41,16,22)">\</span> | <span style="color:rgb(51,88,104)">T +972-2-5494522 | F +972-2-5494522</span>
<span style="color:rgb(1,84,76)">//</span> <span style="color:rgb(21,122,134)">\</span> | <span style="color:rgb(51,88,104)"><a href="mailto:irush@cs.huji.ac.il" target="_blank">irush@cs.huji.ac.il</a></span>
<span style="color:rgb(127,130,103)">/</span><span style="color:rgb(1,84,76)">/</span> |
</pre>
</div>
</div></div>