[slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

Robert Kudyba rkudyba at fordham.edu
Mon Jan 20 19:03:37 UTC 2020


We are on a Bright Cluster and their support says the head node controls
this. Here you can see the sym links:

[root at node001 ~]# file /etc/slurm/slurm.conf
/etc/slurm/slurm.conf: symbolic link to
`/cm/shared/apps/slurm/var/etc/slurm.conf'

[root at ourcluster myuser]# file /etc/slurm/slurm.conf
/etc/slurm/slurm.conf: symbolic link to
`/cm/shared/apps/slurm/var/etc/slurm.conf'

 ls -l  /etc/slurm/slurm.conf
lrwxrwxrwx 1 root root 40 Nov 30  2018 /etc/slurm/slurm.conf ->
/cm/shared/apps/slurm/var/etc/slurm.conf
[root at ourcluster myuser]# ssh node001
Last login: Mon Jan 20 14:02:00 2020
[root at node001 ~]# ls -l  /etc/slurm/slurm.conf
lrwxrwxrwx 1 root root 40 Nov 30  2018 /etc/slurm/slurm.conf ->
/cm/shared/apps/slurm/var/etc/slurm.conf

On Mon, Jan 20, 2020 at 1:52 PM Brian Andrus <toomuchit at gmail.com> wrote:

> Try using "nodename=node003" in the slurm.conf on your nodes.
>
> Also, make sure the slurm.conf on the nodes is the same as on the head.
>
> Somewhere in there, you have "node=node003" (as well as the other nodes
> names).
>
> That may even do it, as they may be trying to register generically, so
> their configs are not getting matched to the specific info in your main
> config
>
> Brian Andrus
>
>
> On 1/20/2020 10:37 AM, Robert Kudyba wrote:
>
> I've posted about this previously here
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_forum_-23-21searchin_slurm-2Dusers_kudyba-257Csort-3Adate_slurm-2Dusers_mMECjerUmFE_V1wK19fFAQAJ&d=DwMDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=536v1kqVHYCPVjdMowh4_kfCXSihJp1LwoDKM8FWu08&s=V4tz7Qab3oK28vrC090A6R6aFEaDXz7Czqr5y2eDUk0&e=>,
> and here
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_forum_-23-21searchin_slurm-2Dusers_kudyba-257Csort-3Adate_slurm-2Dusers_vVAyqm0wg3Y_2YoBq744AAAJ&d=DwMDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=536v1kqVHYCPVjdMowh4_kfCXSihJp1LwoDKM8FWu08&s=eEetgW964TvhYChxX27f_Bjz3tn5UlwUpVEVAZIdIKo&e=> so
> I'm trying to get to the bottom of this once and for all and even got this
> comment
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_msg_slurm-2Dusers_vVAyqm0wg3Y_x9-2D-5FiQQaBwAJ&d=DwMDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=536v1kqVHYCPVjdMowh4_kfCXSihJp1LwoDKM8FWu08&s=5UB2Ohj42gVpQ0GXneP02dO3kpRATj5OvQ4nmNTWZd4&e=>
> previously:
>
> our problem here is that the configuration for the nodes in question have
>> an incorrect amount of memory set for them. Looks like you have it set in
>> bytes instead of megabytes
>> In your slurm.conf you should look at the RealMemory setting:
>> RealMemory
>> Size of real memory on the node in megabytes (e.g. "2048"). The default
>> value is 1.
>> I would suggest RealMemory=191879 , where I suspect you have
>> RealMemory=196489092
>
>
> Now the slurmctld logs show this:
>
> [2020-01-20T13:22:48.256] error: Node node002 has low real_memory size
> (191840 < 196489092)
> [2020-01-20T13:22:48.256] error: Setting node node002 state to DRAIN
> [2020-01-20T13:22:48.256] drain_nodes: node node002 state set to DRAIN
> [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration
> node=node002: Invalid argument
> [2020-01-20T13:22:48.256] error: Node node001 has low real_memory size
> (191846 < 196489092)
> [2020-01-20T13:22:48.256] error: Setting node node001 state to DRAIN
> [2020-01-20T13:22:48.256] drain_nodes: node node001 state set to DRAIN
> [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration
> node=node001: Invalid argument
> [2020-01-20T13:22:48.256] error: Node node003 has low real_memory size
> (191840 < 196489092)
> [2020-01-20T13:22:48.256] error: Setting node node003 state to DRAIN
> [2020-01-20T13:22:48.256] drain_nodes: node node003 state set to DRAIN
> [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration
> node=node003: Invalid argument
>
> Here's the setting in slurm.conf:
> /etc/slurm/slurm.conf
> # Nodes
> NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 Sockets=2
> Gres=gpu:1
> # Partitions
> PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
> Shared=NO GraceTime=0 Preempt$
> PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
> Shared=NO GraceTime=0 PreemptM$
>
> sinfo -N
> NODELIST   NODES PARTITION STATE
> node001        1     defq* drain
> node002        1     defq* drain
> node003        1     defq* drain
>
> sinfo -N
> NODELIST   NODES PARTITION STATE
> node001        1     defq* drain
> node002        1     defq* drain
> node003        1     defq* drain
>
> [2020-01-20T12:50:51.034] error: Node node003 has low real_memory size
> (191840 < 196489092)
> [2020-01-20T12:50:51.034] error: _slurm_rpc_node_registration
> node=node003: Invalid argument
>
> /etc/slurm/slurm.conf
> # Nodes
> NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092 Sockets=2
> Gres=gpu:1
> # Partitions
> PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
> Shared=NO GraceTime=0 Preempt$
> PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
> Shared=NO GraceTime=0 PreemptM$
>
> pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
> node001: Thread(s) per core:    1
> node001: Core(s) per socket:    12
> node001: Socket(s):             2
> node002: Thread(s) per core:    1
> node002: Core(s) per socket:    12
> node002: Socket(s):             2
> node003: Thread(s) per core:    2
> node003: Core(s) per socket:    12
> node003: Socket(s):             2
>
> module load cmsh
> [root at ciscluster kudyba]# cmsh
> [ciscluster]% jobqueue
> [ciscluster->jobqueue(slurm)]% ls
> Type         Name                     Nodes
> ------------ ------------------------
> ----------------------------------------------------
> Slurm        defq                     node001..node003
> Slurm        gpuq
>
> use defq
> [ciscluster->jobqueue(slurm)->defq]% get options
> QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP
>
> scontrol show nodes node001
> NodeName=node001 Arch=x86_64 CoresPerSocket=12
>    CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.07
>    AvailableFeatures=(null)
>    ActiveFeatures=(null)
>    Gres=gpu:1
>    NodeAddr=node001 NodeHostName=node001 Version=17.11
>    OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
>    RealMemory=196489092 AllocMem=0 FreeMem=98557 Sockets=2 Boards=1
>    State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
>    Partitions=defq
>    BootTime=2019-07-18T12:08:42 SlurmdStartTime=2020-01-17T21:34:15
>    CfgTRES=cpu=24,mem=196489092M,billing=24
>    AllocTRES=
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>    Reason=Low RealMemory [slurm at 2020-01-20T13:22:48]
>
> sinfo -R
> REASON               USER      TIMESTAMP           NODELIST
> Low RealMemory       slurm     2020-01-20T13:22:48 node[001-003]
>
> And the total memory in each node:
> ssh node001
> Last login: Mon Jan 20 13:34:00 2020
> [root at node001 ~]# free -h
>               total        used        free      shared  buff/cache
> available
> Mem:           187G         69G         96G        4.0G         21G
>  112G
> Swap:           11G         11G         55M
>
> What setting is incorrect here?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200120/fd78cb6d/attachment-0001.htm>


More information about the slurm-users mailing list