We do something very similar at HMS.  For instance our nodes with 257468MB of RAM we round down RealMemory to 257000MB, for nodes with 1031057MB of RAM we round down to 1000000 etc. 

We may tune this on our next OS and Slurm update as I expect to see more memory used by the OS as we migrating to RHEL9.

Cheers

--
Mick Timony
Senior DevOps Engineer
LASER, Longwood, & O2 Cluster Admin
Harvard Medical School
--

From: Paul Edmon via slurm-users <slurm-users@lists.schedmd.com>
Sent: Monday, May 12, 2025 10:14 AM
To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Re: Do I have to hold back RAM for worker nodes?
 

The way we typically do it here is that we look at the idle memory usage of the system by the OS and then reserve the nearest power of 2 for that. For instance right now we have 16 GB set for our MemSpecLimit. That may seem like a lot but our nodes typically have 1 TB of memory so 16 GB is not that much. The newer hardware tends to eat up more base memory, at least from my experience.

-Paul Edmon-

On 5/12/25 8:55 AM, Xaver Stiensmeier via slurm-users wrote:

Josh,

thank you for your thorough answer. I, too, considered switching to CR_Core_Memory after reading into this. Thank you for confirming my suspicion that without Memory, we cannot handle high memory requests adequately.

If I may ask: How do you come up with the specific MemSpecLimit? Do you handpick a value for each node, have you picked a constant value for all nodes or do you take a capped percentage of the maximum memory available?

Best regards,
Xaver

On 5/12/25 14:43, Joshua Randall wrote:
Xaver,

It is my understanding that if we want to have stable systems that don't run out of memory, we do need to manage the amount of memory needed for everything not running within a slurm job, yes. 

In our cluster, we are using `CR_Core_Memory` (so we do constrain our job memory) and we set the `RealMemory` to the actual full amount of memory available on the machine - I believe these really are given in megabytes (MB), not mebibytes (MiB). I think their example of (e.g. "2048") is intended to convey this because 2000 MiB is 2048 MB. We set the `MemSpecLimit` for each node to set memory aside for everything in the system that is not running within a slurm job. This include the slurm daemon itself, the kernel, filesystem drivers, metrics collection agents, etc -- anything else we are running outside the control of slurm jobs. The `MemSpecLimit` just sets aside the specified amount and the result will be that the maximum memory jobs can use on the node is (RealMemory - MemSpecLimit). When using cgroups to limit memory, slurmd will also be allocated the specified limit so that the slurm daemon cannot encroach on job memory. However, note that `MemSpecLimit` is documented to not work unless your `SelectTypeParameters` includes Memory as a consumable resource. 

Since you are using `CR_Core` (which does not configure Memory as a consumable resource) then I believe your system will not be constraining job memory at all. Jobs can oversubscribe memory as many times over as there are cores, and any job would be able to run the machine out of memory by using more than is available. With this setting, I guess you could say you don't have to manage reserving memory for the OS and slurmd, but only in the sense that any job could consume all the memory and cause the system OOM killer to kill a random process (including slurmd or something else system critical). 

Cheers,

Josh.


--
Dr. Joshua C. Randall 
Director of Software Engineering, HPC

Altos Labs 



On Mon, May 12, 2025 at 10:27 AM Xaver Stiensmeier via slurm-users <slurm-users@lists.schedmd.com> wrote:

Dear Slurm-User List,

currently, in our slurm.conf, we are setting:

SelectType: select/cons_tres
SelectTypeParameters: CR_Core

and in our node configuration RealMemory was basically reduced by an amount to make sure the node always had enough RAM to run the OS. However, this is apparently now how it is supposed to be done:

Lowering RealMemory with the goal of setting aside some amount for the OS and not available for job allocations will not work as intended if Memory is not set as a consumable resource in SelectTypeParameters. So one of the *_Memory options need to be enabled for that goal to be accomplished. (https://slurm.schedmd.com/slurm.conf.html#OPT_RealMemory)

This leads to four questions regarding holding back RAM for worker nodes. Answers/help with any of those questions would be appreciated.

1. Is reserving enough RAM for the worker node's OS and slurmd actually a thing you have to manage?
2. If so how can we reserve enough RAM for the worker node's OS and slurmd when using CR_Core?
3. Is that maybe a strong argument against using CR_Core that we overlooked?

And semi-related: https://slurm.schedmd.com/slurm.conf.html#OPT_RealMemory talks about taking a value in megabytes.

4. Is RealMemory really expecting megabytes or is it mebibytes?

Best regards,
Xaver


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Altos Labs UK Limited | England | Company reg 13484917  
Registered address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire, United Kingdom, WA14 2DT