[slurm-users] SLURM on a large shared memory node

Benjamin Redling benjamin.rampe at uni-jena.de
Thu Dec 3 12:59:28 UTC 2020


Hello Benson,

On 24/11/2020 14.20, Benson Muite wrote:
> Am setting up SLURM on a single shared memory machine. Found the 
> following blog post:
> http://rolk.github.io/2015/04/20/slurm-cluster

sorry, but that is only a random, outdated blog post from 2015.
Even the Debian 9 stretch provided Slurm 16.05 has automatic handling of 
cgroups -- you don't have to set them up manually.

I recommend looking up which version is packaged for your distribution 
if you're not going for compilation from source and depending on your 
choice start either with the official documentation for the current 
version 20.11
https://slurm.schedmd.com/,
or with the documentation related to your available, packaged version in 
the archive
https://slurm.schedmd.com/archive/


> The main suggestion is to use cgroups to partition the resources. Are 
> ther any other suggestions of changes to implement that differ from the 
> standard cluster setup?

I would start with the defaults and read, read, ..., read, while trying 
to add features step by step.
I think imitating a setup you don't really understand is a really bad 
idea, there will be more than enough questions, even when starting with 
the basics.

Start slow. Try and look up the defaults of your packaged version, if 
you're compiling from source, use the config generators from SchedMD 
after reading through the basics.
You could run multiple jobs on a single node before cgroups. Try to find 
the relevant sections in the official documentation to understand, how 
that works, it's limitations and why it might be a good thing to use 
cgroups nowadays.
Again, having a vague idea that you want to "partition" the node won't 
bring you very far. IMO it's better to have at least a basic idea of 
Slurm operation.
[D.C.]
What do you want next?
(The first thing I wanted in a cluster was select/cons_res with 
CR_Core_Memory instead of the default select/linear. RTFM what that all 
means and why that is/isn't a good idea in your case; when or when not 
to use CR_CPU_Memory; next was understanding backfilling and it's 
requirement for time limits).
Read.
Test.
Optionally ask on the list if you're having a single concrete issue.
[Da Capo al Fine]

(In parallel and repeatedly):
1a) The official documentation
1b) Ole Holm Nielson's docs, starting at 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation -- even if you're 
using a Debian-based distribution, read it to get an understanding of 
the different parts a Slurm installation is made of.

(For anything beyond the basics you didn't grasp from 1a & b):
2) Blog of Chris Samuel (csamuel.org)

Reading the list for a longer time and trying to understand the topics 
that might be applicable to your setup will help a lot -- and you'll 
notice who else on the list you prefer reading / who is willing to 
answer questions you have / has similar issues that get answers you can 
learn from.

Good luck,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Redling



More information about the slurm-users mailing list