Dear slurm-user list,
in the past we had a bigger buffer between RealMemory https://slurm.schedmd.com/slurm.conf.html#OPT_RealMemory and the instance memory. We then discovered that the right way is to activating the *memory option* (SelectTypeParameters=CR_Core_Memory) and setting MemSpecLimit https://slurm.schedmd.com/slurm.conf.html#OPT_MemSpecLimit to secure RAM for system processes.
However, now we run into the problem that due to *on demand scheduling*, we have to setup the slurm.conf in advance by using the RAM values from our flavors as reported by our cloud provider (OpenStack). These RAM values are higher than the RAM values the machines actually have later on:
ram_in_mib by openstack total_ram_in_mib by top/slurm 2048 1968 16384 15991 32768 32093 65536 64297 122880 120749 245760 241608 491520 483528
Given that we have to define the slurm.conf in advance, we kinda have to predict how much total ram the instances have once created. Of course I used linear regression to approximate the total ram and then lowered it a bit to have some cushion, but this feels unsafe given that future flavors could differ from that.
From the kernel documentation https://www.kernel.org/doc/Documentation/filesystems/proc.txt I know that MemTotal is
MemTotal: Total usable ram (i.e. physical ram minus a few reserved bits and the kernel binary code)
but given that the concrete reserved bits are quite complex https://witekio.com/blog/cat-proc-meminfo-memtotal/, I am wondering whether I am doing something wrong as this issue doesn't feel niche enough to be that complicated.
---
Anyway, setting the RAM value in the slurm.conf above total ram by predicting too much, leads to errors and nodes being marked as invalid:
[2025-08-11T08:19:04.736] debug: Node NODE_NAME has low real_memory size (241607 / 245760) < 100.00% [2025-08-11T08:19:04.736] error: _slurm_rpc_node_registration node=NODE_NAME: Invalid argument
or
|[2025-07-03T12:57:18.486] error: Setting node NODE_NAME state to INVAL with reason:Low RealMemory (reported:64295 < 100.00% of configured:68719)|
|Any hint on how to solve this is much appreciated! |
Best regards, Xaver