[slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

Robert Kudyba rkudyba at fordham.edu
Mon Jan 20 18:37:27 UTC 2020


I've posted about this previously here
<https://groups.google.com/forum/#!searchin/slurm-users/kudyba%7Csort:date/slurm-users/mMECjerUmFE/V1wK19fFAQAJ>,
and here
<https://groups.google.com/forum/#!searchin/slurm-users/kudyba%7Csort:date/slurm-users/vVAyqm0wg3Y/2YoBq744AAAJ>
so
I'm trying to get to the bottom of this once and for all and even got this
comment
<https://groups.google.com/d/msg/slurm-users/vVAyqm0wg3Y/x9-_iQQaBwAJ>
previously:

our problem here is that the configuration for the nodes in question have
> an incorrect amount of memory set for them. Looks like you have it set in
> bytes instead of megabytes
> In your slurm.conf you should look at the RealMemory setting:
> RealMemory
> Size of real memory on the node in megabytes (e.g. "2048"). The default
> value is 1.
> I would suggest RealMemory=191879 , where I suspect you have
> RealMemory=196489092


Now the slurmctld logs show this:

[2020-01-20T13:22:48.256] error: Node node002 has low real_memory size
(191840 < 196489092)
[2020-01-20T13:22:48.256] error: Setting node node002 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node002 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration node=node002:
Invalid argument
[2020-01-20T13:22:48.256] error: Node node001 has low real_memory size
(191846 < 196489092)
[2020-01-20T13:22:48.256] error: Setting node node001 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node001 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration node=node001:
Invalid argument
[2020-01-20T13:22:48.256] error: Node node003 has low real_memory size
(191840 < 196489092)
[2020-01-20T13:22:48.256] error: Setting node node003 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node003 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration node=node003:
Invalid argument

Here's the setting in slurm.conf:
/etc/slurm/slurm.conf
# Nodes
NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 Sockets=2
Gres=gpu:1
# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 Preempt$
PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 PreemptM$

sinfo -N
NODELIST   NODES PARTITION STATE
node001        1     defq* drain
node002        1     defq* drain
node003        1     defq* drain

sinfo -N
NODELIST   NODES PARTITION STATE
node001        1     defq* drain
node002        1     defq* drain
node003        1     defq* drain

[2020-01-20T12:50:51.034] error: Node node003 has low real_memory size
(191840 < 196489092)
[2020-01-20T12:50:51.034] error: _slurm_rpc_node_registration node=node003:
Invalid argument

/etc/slurm/slurm.conf
# Nodes
NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 Sockets=2
Gres=gpu:1
# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 Preempt$
PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
Shared=NO GraceTime=0 PreemptM$

pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
node001: Thread(s) per core:    1
node001: Core(s) per socket:    12
node001: Socket(s):             2
node002: Thread(s) per core:    1
node002: Core(s) per socket:    12
node002: Socket(s):             2
node003: Thread(s) per core:    2
node003: Core(s) per socket:    12
node003: Socket(s):             2

module load cmsh
[root at ciscluster kudyba]# cmsh
[ciscluster]% jobqueue
[ciscluster->jobqueue(slurm)]% ls
Type         Name                     Nodes
------------ ------------------------
----------------------------------------------------
Slurm        defq                     node001..node003
Slurm        gpuq

use defq
[ciscluster->jobqueue(slurm)->defq]% get options
QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP

scontrol show nodes node001
NodeName=node001 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.07
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:1
   NodeAddr=node001 NodeHostName=node001 Version=17.11
   OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
   RealMemory=196489092 AllocMem=0 FreeMem=98557 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=defq
   BootTime=2019-07-18T12:08:42 SlurmdStartTime=2020-01-17T21:34:15
   CfgTRES=cpu=24,mem=196489092M,billing=24
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [slurm at 2020-01-20T13:22:48]

sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Low RealMemory       slurm     2020-01-20T13:22:48 node[001-003]

And the total memory in each node:
ssh node001
Last login: Mon Jan 20 13:34:00 2020
[root at node001 ~]# free -h
              total        used        free      shared  buff/cache
available
Mem:           187G         69G         96G        4.0G         21G
 112G
Swap:           11G         11G         55M

What setting is incorrect here?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200120/e42e58fd/attachment-0001.htm>


More information about the slurm-users mailing list