[slurm-users] SLURM on a large shared memory node
Benjamin Redling
benjamin.rampe at uni-jena.de
Thu Dec 3 12:59:28 UTC 2020
Hello Benson,
On 24/11/2020 14.20, Benson Muite wrote:
> Am setting up SLURM on a single shared memory machine. Found the
> following blog post:
> http://rolk.github.io/2015/04/20/slurm-cluster
sorry, but that is only a random, outdated blog post from 2015.
Even the Debian 9 stretch provided Slurm 16.05 has automatic handling of
cgroups -- you don't have to set them up manually.
I recommend looking up which version is packaged for your distribution
if you're not going for compilation from source and depending on your
choice start either with the official documentation for the current
version 20.11
https://slurm.schedmd.com/,
or with the documentation related to your available, packaged version in
the archive
https://slurm.schedmd.com/archive/
> The main suggestion is to use cgroups to partition the resources. Are
> ther any other suggestions of changes to implement that differ from the
> standard cluster setup?
I would start with the defaults and read, read, ..., read, while trying
to add features step by step.
I think imitating a setup you don't really understand is a really bad
idea, there will be more than enough questions, even when starting with
the basics.
Start slow. Try and look up the defaults of your packaged version, if
you're compiling from source, use the config generators from SchedMD
after reading through the basics.
You could run multiple jobs on a single node before cgroups. Try to find
the relevant sections in the official documentation to understand, how
that works, it's limitations and why it might be a good thing to use
cgroups nowadays.
Again, having a vague idea that you want to "partition" the node won't
bring you very far. IMO it's better to have at least a basic idea of
Slurm operation.
[D.C.]
What do you want next?
(The first thing I wanted in a cluster was select/cons_res with
CR_Core_Memory instead of the default select/linear. RTFM what that all
means and why that is/isn't a good idea in your case; when or when not
to use CR_CPU_Memory; next was understanding backfilling and it's
requirement for time limits).
Read.
Test.
Optionally ask on the list if you're having a single concrete issue.
[Da Capo al Fine]
(In parallel and repeatedly):
1a) The official documentation
1b) Ole Holm Nielson's docs, starting at
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation -- even if you're
using a Debian-based distribution, read it to get an understanding of
the different parts a Slurm installation is made of.
(For anything beyond the basics you didn't grasp from 1a & b):
2) Blog of Chris Samuel (csamuel.org)
Reading the list for a longer time and trying to understand the topics
that might be applicable to your setup will help a lot -- and you'll
notice who else on the list you prefer reading / who is willing to
answer questions you have / has similar issues that get answers you can
learn from.
Good luck,
Benjamin
--
FSU Jena | JULIELab.de/Staff/Redling
More information about the slurm-users
mailing list