[slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument
Marcus Wagner
wagner at itc.rwth-aachen.de
Tue Jan 21 07:51:25 UTC 2020
Dear Robert,
On 1/20/20 7:37 PM, Robert Kudyba wrote:
> I've posted about this previously here
> <https://groups.google.com/forum/#!searchin/slurm-users/kudyba%7Csort:date/slurm-users/mMECjerUmFE/V1wK19fFAQAJ>,
> and here
> <https://groups.google.com/forum/#!searchin/slurm-users/kudyba%7Csort:date/slurm-users/vVAyqm0wg3Y/2YoBq744AAAJ> so
> I'm trying to get to the bottom of this once and for all and even got
> this comment
> <https://groups.google.com/d/msg/slurm-users/vVAyqm0wg3Y/x9-_iQQaBwAJ>
> previously:
>
> our problem here is that the configuration for the nodes in
> question have an incorrect amount of memory set for them. Looks
> like you have it set in bytes instead of megabytes
> In your slurm.conf you should look at the RealMemory setting:
> RealMemory
> Size of real memory on the node in megabytes (e.g. "2048"). The
> default value is 1.
> I would suggest RealMemory=191879 , where I suspect you have
> RealMemory=196489092
>
>
are you sure, your 24 core nodes have 187 TERABYTES memory?
As you yourself cited:
> Size of real memory on the node in megabytes
The settings in your slurm.conf:
> NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092
> Sockets=2 Gres=gpu:1
so, your machines should have 196489092 megabytes memory, that are
~191884 gigabytes or ~187 terabytes
Slurm believes, these machines do NOT have that much memory:
> [2020-01-20T13:22:48.256] error: Node node002 has low real_memory size
> (191840 < 196489092)
It sees only 191840 megabytes, which is still less than the 191884.
Since the available memory changes slightly from OS version to OS
version, I would suggest to set RealMemory to less than 191840, e.g. 191800.
But Brian already told you to reduce the RealMemory:
> I would suggest RealMemory=191879 , where I suspect you have
> RealMemory=196489092
If SLURM sees less than RealMemory on a node, it drains the node,
because a defective DIMM is assumed.
Best
Marcus
> Now the slurmctld logs show this:
>
> [2020-01-20T13:22:48.256] error: Node node002 has low real_memory size
> (191840 < 196489092)
> [2020-01-20T13:22:48.256] error: Setting node node002 state to DRAIN
> [2020-01-20T13:22:48.256] drain_nodes: node node002 state set to DRAIN
> [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration
> node=node002: Invalid argument
> [2020-01-20T13:22:48.256] error: Node node001 has low real_memory size
> (191846 < 196489092)
> [2020-01-20T13:22:48.256] error: Setting node node001 state to DRAIN
> [2020-01-20T13:22:48.256] drain_nodes: node node001 state set to DRAIN
> [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration
> node=node001: Invalid argument
> [2020-01-20T13:22:48.256] error: Node node003 has low real_memory size
> (191840 < 196489092)
> [2020-01-20T13:22:48.256] error: Setting node node003 state to DRAIN
> [2020-01-20T13:22:48.256] drain_nodes: node node003 state set to DRAIN
> [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration
> node=node003: Invalid argument
>
> Here's the setting in slurm.conf:
> /etc/slurm/slurm.conf
> # Nodes
> NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092
> Sockets=2 Gres=gpu:1
> # Partitions
> PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO
> Hidden=NO Shared=NO GraceTime=0 Preempt$
> PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO
> Hidden=NO Shared=NO GraceTime=0 PreemptM$
>
> sinfo -N
> NODELIST NODES PARTITION STATE
> node001 1 defq* drain
> node002 1 defq* drain
> node003 1 defq* drain
>
> sinfo -N
> NODELIST NODES PARTITION STATE
> node001 1 defq* drain
> node002 1 defq* drain
> node003 1 defq* drain
>
> [2020-01-20T12:50:51.034] error: Node node003 has low real_memory size
> (191840 < 196489092)
> [2020-01-20T12:50:51.034] error: _slurm_rpc_node_registration
> node=node003: Invalid argument
>
> /etc/slurm/slurm.conf
> # Nodes
> NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092
> Sockets=2 Gres=gpu:1
> # Partitions
> PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO
> Hidden=NO Shared=NO GraceTime=0 Preempt$
> PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO
> Hidden=NO Shared=NO GraceTime=0 PreemptM$
>
> pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
> node001: Thread(s) per core: 1
> node001: Core(s) per socket: 12
> node001: Socket(s): 2
> node002: Thread(s) per core: 1
> node002: Core(s) per socket: 12
> node002: Socket(s): 2
> node003: Thread(s) per core: 2
> node003: Core(s) per socket: 12
> node003: Socket(s): 2
>
> module load cmsh
> [root at ciscluster kudyba]# cmsh
> [ciscluster]% jobqueue
> [ciscluster->jobqueue(slurm)]% ls
> Type Name Nodes
> ------------ ------------------------
> ----------------------------------------------------
> Slurm defq node001..node003
> Slurm gpuq
>
> use defq
> [ciscluster->jobqueue(slurm)->defq]% get options
> QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP
>
> scontrol show nodes node001
> NodeName=node001 Arch=x86_64 CoresPerSocket=12
> CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.07
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=gpu:1
> NodeAddr=node001 NodeHostName=node001 Version=17.11
> OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
> RealMemory=196489092 AllocMem=0 FreeMem=98557 Sockets=2 Boards=1
> State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
> Partitions=defq
> BootTime=2019-07-18T12:08:42 SlurmdStartTime=2020-01-17T21:34:15
> CfgTRES=cpu=24,mem=196489092M,billing=24
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> Reason=Low RealMemory [slurm at 2020-01-20T13:22:48]
>
> sinfo -R
> REASON USER TIMESTAMP NODELIST
> Low RealMemory slurm 2020-01-20T13:22:48 node[001-003]
>
> And the total memory in each node:
> ssh node001
> Last login: Mon Jan 20 13:34:00 2020
> [root at node001 ~]# free -h
> total used free shared buff/cache
> available
> Mem: 187G 69G 96G 4.0G 21G 112G
> Swap: 11G 11G 55M
>
> What setting is incorrect here?
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200121/aa5159c9/attachment.htm>
More information about the slurm-users
mailing list