[slurm-users] "Low RealMem" after upgrade
Brian Andrus
toomuchit at gmail.com
Fri Oct 1 17:46:53 UTC 2021
Not unusual. You should set your amount of memory a bit below what
slurmd reports.
Different kernel modules that get upgraded may use a little more memory,
causing just this situation. There are other causes as well, but by
providing the kernel/system some wiggle room, you prevent any issues.
Also helps with OOM killer situations.
Brian Andrus
On 10/1/2021 1:22 AM, Diego Zuccato wrote:
> Hello all.
>
> I just upgraded to Debian 11 that brings Slurm 21.08 and the newer
> nodes upgraded w/o too many issues (just minor config changes, one
> being RealMemory value in slurm.conf, since for some reason it seems
> the new slurmd detects about 12MB less memory than before).
>
> But the older nodes are still marked IDLE+DRAIN:
> -8<--
> NodeName=str957-bl0-01 Arch=x86_64 CoresPerSocket=6
> CPUAlloc=0 CPUTot=24 CPULoad=0.39
> AvailableFeatures=ib,blade,intel,avx
> ActiveFeatures=ib,blade,intel,avx
> Gres=(null)
> NodeAddr=str957-bl0-01 NodeHostName=str957-bl0-01 Version=20.11.4
> OS=Linux 5.10.0-8-amd64 #1 SMP Debian 5.10.46-5 (2021-09-23)
> RealMemory=64000 AllocMem=0 FreeMem=63518 Sockets=2 Boards=1
> MemSpecLimit=2048
> State=IDLE+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=2 Owner=N/A
> MCS_label=N/A
> Partitions=b1
> BootTime=2021-10-01T09:35:42 SlurmdStartTime=2021-10-01T09:36:15
> CfgTRES=cpu=24,mem=62.50G,billing=182
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 AveWatts=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> Reason=Low RealMemory [root at 2021-10-01T08:08:18]
> Comment=(null)
> -8<--
> I already reduced RealMemory line in slurm.conf and restarted both
> slurmctld and slurmd (in case "scontrol reconfigure" was not enough...
> not really clear from the docs).
>
> The relevant lines in slurm.conf are:
> -8<--
> NodeName=DEFAULT Sockets=2 ThreadsPerCore=2 State=UNKNOWN
> MemSpecLimit=2048
> NodeName=str957-bl0-0[1-2] CoresPerSocket=6
> RealMemory=64000 Weight=2 Feature=ib,blade,intel,avx
> -8<--
>
> And the node says:
> -8<--
> root at str957-bl0-01:~# slurmd -C
> NodeName=str957-bl0-01 CPUs=24 Boards=1 SocketsPerBoard=2
> CoresPerSocket=6 ThreadsPerCore=2 RealMemory=64378
> UpTime=0-00:37:17
> -8<--
>
> I also tried lowering RealMemory setting to 60000, in case
> MemSpecLimit interfered, but the result remains the same.
>
> Any ideas?
>
> TIA!
>
More information about the slurm-users
mailing list