[slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

Robert Kudyba rkudyba at fordham.edu
Mon Jan 20 18:37:27 UTC 2020

I've posted about this previously here
and here
I'm trying to get to the bottom of this once and for all and even got this

our problem here is that the configuration for the nodes in question have
> an incorrect amount of memory set for them. Looks like you have it set in
> bytes instead of megabytes
> In your slurm.conf you should look at the RealMemory setting:
> RealMemory
> Size of real memory on the node in megabytes (e.g. "2048"). The default
> value is 1.
> I would suggest RealMemory=191879 , where I suspect you have
> RealMemory=196489092

Now the slurmctld logs show this:

[2020-01-20T13:22:48.256] error: Node node002 has low real_memory size
(191840 < 196489092)
[2020-01-20T13:22:48.256] error: Setting node node002 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node002 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration node=node002:
Invalid argument
[2020-01-20T13:22:48.256] error: Node node001 has low real_memory size
(191846 < 196489092)
[2020-01-20T13:22:48.256] error: Setting node node001 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node001 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration node=node001:
Invalid argument
[2020-01-20T13:22:48.256] error: Node node003 has low real_memory size
(191840 < 196489092)
[2020-01-20T13:22:48.256] error: Setting node node003 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node003 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration node=node003:
Invalid argument

Here's the setting in slurm.conf:
# Nodes
NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 Sockets=2
# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 Preempt$
PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 PreemptM$

sinfo -N
node001        1     defq* drain
node002        1     defq* drain
node003        1     defq* drain

sinfo -N
node001        1     defq* drain
node002        1     defq* drain
node003        1     defq* drain

[2020-01-20T12:50:51.034] error: Node node003 has low real_memory size
(191840 < 196489092)
[2020-01-20T12:50:51.034] error: _slurm_rpc_node_registration node=node003:
Invalid argument

# Nodes
NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 Sockets=2
# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 Preempt$
PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 PreemptM$

pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
node001: Thread(s) per core:    1
node001: Core(s) per socket:    12
node001: Socket(s):             2
node002: Thread(s) per core:    1
node002: Core(s) per socket:    12
node002: Socket(s):             2
node003: Thread(s) per core:    2
node003: Core(s) per socket:    12
node003: Socket(s):             2

module load cmsh
[root at ciscluster kudyba]# cmsh
[ciscluster]% jobqueue
[ciscluster->jobqueue(slurm)]% ls
Type         Name                     Nodes
------------ ------------------------
Slurm        defq                     node001..node003
Slurm        gpuq

use defq
[ciscluster->jobqueue(slurm)->defq]% get options
QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP

scontrol show nodes node001
NodeName=node001 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.07
   NodeAddr=node001 NodeHostName=node001 Version=17.11
   OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
   RealMemory=196489092 AllocMem=0 FreeMem=98557 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
   BootTime=2019-07-18T12:08:42 SlurmdStartTime=2020-01-17T21:34:15
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [slurm at 2020-01-20T13:22:48]

sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Low RealMemory       slurm     2020-01-20T13:22:48 node[001-003]

And the total memory in each node:
ssh node001
Last login: Mon Jan 20 13:34:00 2020
[root at node001 ~]# free -h
              total        used        free      shared  buff/cache
Mem:           187G         69G         96G        4.0G         21G
Swap:           11G         11G         55M

What setting is incorrect here?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200120/e42e58fd/attachment-0001.htm>

More information about the slurm-users mailing list