[slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument
Brian Andrus
toomuchit at gmail.com
Mon Jan 20 18:51:56 UTC 2020
Try using "nodename=node003" in the slurm.conf on your nodes.
Also, make sure the slurm.conf on the nodes is the same as on the head.
Somewhere in there, you have "node=node003" (as well as the other nodes
names).
That may even do it, as they may be trying to register generically, so
their configs are not getting matched to the specific info in your main
config
Brian Andrus
On 1/20/2020 10:37 AM, Robert Kudyba wrote:
> I've posted about this previously here
> <https://groups.google.com/forum/#!searchin/slurm-users/kudyba%7Csort:date/slurm-users/mMECjerUmFE/V1wK19fFAQAJ>,
> and here
> <https://groups.google.com/forum/#!searchin/slurm-users/kudyba%7Csort:date/slurm-users/vVAyqm0wg3Y/2YoBq744AAAJ> so
> I'm trying to get to the bottom of this once and for all and even got
> this comment
> <https://groups.google.com/d/msg/slurm-users/vVAyqm0wg3Y/x9-_iQQaBwAJ>
> previously:
>
> our problem here is that the configuration for the nodes in
> question have an incorrect amount of memory set for them. Looks
> like you have it set in bytes instead of megabytes
> In your slurm.conf you should look at the RealMemory setting:
> RealMemory
> Size of real memory on the node in megabytes (e.g. "2048"). The
> default value is 1.
> I would suggest RealMemory=191879 , where I suspect you have
> RealMemory=196489092
>
>
> Now the slurmctld logs show this:
>
> [2020-01-20T13:22:48.256] error: Node node002 has low real_memory size
> (191840 < 196489092)
> [2020-01-20T13:22:48.256] error: Setting node node002 state to DRAIN
> [2020-01-20T13:22:48.256] drain_nodes: node node002 state set to DRAIN
> [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration
> node=node002: Invalid argument
> [2020-01-20T13:22:48.256] error: Node node001 has low real_memory size
> (191846 < 196489092)
> [2020-01-20T13:22:48.256] error: Setting node node001 state to DRAIN
> [2020-01-20T13:22:48.256] drain_nodes: node node001 state set to DRAIN
> [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration
> node=node001: Invalid argument
> [2020-01-20T13:22:48.256] error: Node node003 has low real_memory size
> (191840 < 196489092)
> [2020-01-20T13:22:48.256] error: Setting node node003 state to DRAIN
> [2020-01-20T13:22:48.256] drain_nodes: node node003 state set to DRAIN
> [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration
> node=node003: Invalid argument
>
> Here's the setting in slurm.conf:
> /etc/slurm/slurm.conf
> # Nodes
> NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092
> Sockets=2 Gres=gpu:1
> # Partitions
> PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO
> Hidden=NO Shared=NO GraceTime=0 Preempt$
> PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO
> Hidden=NO Shared=NO GraceTime=0 PreemptM$
>
> sinfo -N
> NODELIST NODES PARTITION STATE
> node001 1 defq* drain
> node002 1 defq* drain
> node003 1 defq* drain
>
> sinfo -N
> NODELIST NODES PARTITION STATE
> node001 1 defq* drain
> node002 1 defq* drain
> node003 1 defq* drain
>
> [2020-01-20T12:50:51.034] error: Node node003 has low real_memory size
> (191840 < 196489092)
> [2020-01-20T12:50:51.034] error: _slurm_rpc_node_registration
> node=node003: Invalid argument
>
> /etc/slurm/slurm.conf
> # Nodes
> NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092
> Sockets=2 Gres=gpu:1
> # Partitions
> PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO
> Hidden=NO Shared=NO GraceTime=0 Preempt$
> PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO
> Hidden=NO Shared=NO GraceTime=0 PreemptM$
>
> pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
> node001: Thread(s) per core: 1
> node001: Core(s) per socket: 12
> node001: Socket(s): 2
> node002: Thread(s) per core: 1
> node002: Core(s) per socket: 12
> node002: Socket(s): 2
> node003: Thread(s) per core: 2
> node003: Core(s) per socket: 12
> node003: Socket(s): 2
>
> module load cmsh
> [root at ciscluster kudyba]# cmsh
> [ciscluster]% jobqueue
> [ciscluster->jobqueue(slurm)]% ls
> Type Name Nodes
> ------------ ------------------------
> ----------------------------------------------------
> Slurm defq node001..node003
> Slurm gpuq
>
> use defq
> [ciscluster->jobqueue(slurm)->defq]% get options
> QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP
>
> scontrol show nodes node001
> NodeName=node001 Arch=x86_64 CoresPerSocket=12
> CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.07
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=gpu:1
> NodeAddr=node001 NodeHostName=node001 Version=17.11
> OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
> RealMemory=196489092 AllocMem=0 FreeMem=98557 Sockets=2 Boards=1
> State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
> Partitions=defq
> BootTime=2019-07-18T12:08:42 SlurmdStartTime=2020-01-17T21:34:15
> CfgTRES=cpu=24,mem=196489092M,billing=24
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> Reason=Low RealMemory [slurm at 2020-01-20T13:22:48]
>
> sinfo -R
> REASON USER TIMESTAMP NODELIST
> Low RealMemory slurm 2020-01-20T13:22:48 node[001-003]
>
> And the total memory in each node:
> ssh node001
> Last login: Mon Jan 20 13:34:00 2020
> [root at node001 ~]# free -h
> total used free shared buff/cache
> available
> Mem: 187G 69G 96G 4.0G 21G 112G
> Swap: 11G 11G 55M
>
> What setting is incorrect here?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200120/fb3ed2a7/attachment.htm>
More information about the slurm-users
mailing list