<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    Dear Robert,<br>
    <br>
    On 1/20/20 7:37 PM, Robert Kudyba wrote:<br>
    <blockquote type="cite"
cite="mid:CAFHi+KRhntNve=SPMCjWRS9LG=bza9uEyxzthBtkmA9dUeEATw@mail.gmail.com">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <div dir="ltr">I've posted about this previously <a
href="https://groups.google.com/forum/#!searchin/slurm-users/kudyba%7Csort:date/slurm-users/mMECjerUmFE/V1wK19fFAQAJ"
          moz-do-not-send="true">here</a>, and <a
href="https://groups.google.com/forum/#!searchin/slurm-users/kudyba%7Csort:date/slurm-users/vVAyqm0wg3Y/2YoBq744AAAJ"
          moz-do-not-send="true">here</a> so I'm trying to get to the
        bottom of this once and for all and even got <a
href="https://groups.google.com/d/msg/slurm-users/vVAyqm0wg3Y/x9-_iQQaBwAJ"
          moz-do-not-send="true">this comment</a> previously:
        <div><br>
        </div>
        <div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px
            0.8ex;border-left:1px solid
            rgb(204,204,204);padding-left:1ex">our problem here is that
            the configuration for the nodes in question have an
            incorrect amount of memory set for them. Looks like you have
            it set in bytes instead of megabytes<br>
            In your slurm.conf you should look at the RealMemory
            setting:<br>
            RealMemory<br>
            Size of real memory on the node in megabytes (e.g. "2048").
            The default value is 1.<br>
            I would suggest RealMemory=191879 , where I suspect you have
            RealMemory=196489092</blockquote>
          <br>
        </div>
      </div>
    </blockquote>
    are you sure, your 24 core nodes have 187 TERABYTES memory?<br>
    <br>
    As you yourself cited:<br>
    <blockquote type="cite">Size of real memory on the node in megabytes</blockquote>
    The settings in your slurm.conf:<br>
    <blockquote type="cite">NodeName=node[001-003]  CoresPerSocket=12
      RealMemory=196489092 Sockets=2 Gres=gpu:1<br>
    </blockquote>
    so, your machines should have 196489092 megabytes memory, that are
    ~191884 gigabytes or ~187 terabytes<br>
    <br>
    Slurm believes, these machines do NOT have that much memory:<br>
    <blockquote type="cite"><font face="monospace">[2020-01-20T13:22:48.256]
        error: Node node002 has low real_memory size (191840 <
        196489092)<br>
      </font></blockquote>
    It sees only 191840 megabytes, which is still less than the 191884.
    Since the available memory changes slightly from OS version to OS
    version, I would suggest to set RealMemory to less than 191840, e.g.
    191800.<br>
    But Brian already told you to reduce the RealMemory:<br>
    <blockquote type="cite">I would suggest RealMemory=191879 , where I
      suspect you have RealMemory=196489092</blockquote>
    <br>
    If SLURM sees less than RealMemory on a node, it drains the node,
    because a defective DIMM is assumed. <br>
    <br>
    Best<br>
    Marcus<br>
    <br>
    <br>
    <blockquote type="cite"
cite="mid:CAFHi+KRhntNve=SPMCjWRS9LG=bza9uEyxzthBtkmA9dUeEATw@mail.gmail.com">
      <div dir="ltr">
        <div>Now the slurmctld logs show this:</div>
        <div><br>
          <font face="monospace">[2020-01-20T13:22:48.256] error: Node
            node002 has low real_memory size (191840 < 196489092)<br>
            [2020-01-20T13:22:48.256] error: Setting node node002 state
            to DRAIN<br>
            [2020-01-20T13:22:48.256] drain_nodes: node node002 state
            set to DRAIN<br>
            [2020-01-20T13:22:48.256] error:
            _slurm_rpc_node_registration node=node002: Invalid argument<br>
            [2020-01-20T13:22:48.256] error: Node node001 has low
            real_memory size (191846 < 196489092)<br>
            [2020-01-20T13:22:48.256] error: Setting node node001 state
            to DRAIN<br>
            [2020-01-20T13:22:48.256] drain_nodes: node node001 state
            set to DRAIN<br>
            [2020-01-20T13:22:48.256] error:
            _slurm_rpc_node_registration node=node001: Invalid argument<br>
            [2020-01-20T13:22:48.256] error: Node node003 has low
            real_memory size (191840 < 196489092)<br>
            [2020-01-20T13:22:48.256] error: Setting node node003 state
            to DRAIN<br>
            [2020-01-20T13:22:48.256] drain_nodes: node node003 state
            set to DRAIN<br>
            [2020-01-20T13:22:48.256] error:
            _slurm_rpc_node_registration node=node003: Invalid argument</font><br>
        </div>
        <div><br>
        </div>
        <div>Here's the setting in slurm.conf:</div>
        <div>/etc/slurm/slurm.conf<br>
          # Nodes<br>
          NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092
          Sockets=2 Gres=gpu:1<br>
          # Partitions<br>
          PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
          PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO
          RootOnly=NO Hidden=NO Shared=NO GraceTime=0 Preempt$<br>
          PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
          PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO
          RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptM$<br>
        </div>
        <div><br>
        </div>
        <div>sinfo -N<br>
          NODELIST   NODES PARTITION STATE<br>
          node001        1     defq* drain<br>
          node002        1     defq* drain<br>
          node003        1     defq* drain<br>
        </div>
        <div><br>
        </div>
        <div>sinfo -N<br>
          NODELIST   NODES PARTITION STATE<br>
          node001        1     defq* drain<br>
          node002        1     defq* drain<br>
          node003        1     defq* drain<br>
          <br>
          [2020-01-20T12:50:51.034] error: Node node003 has low
          real_memory size (191840 < 196489092)<br>
          [2020-01-20T12:50:51.034] error: _slurm_rpc_node_registration
          node=node003: Invalid argument<br>
          <br>
          /etc/slurm/slurm.conf<br>
          # Nodes<br>
          NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092
          Sockets=2 Gres=gpu:1<br>
          # Partitions<br>
          PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
          PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO
          RootOnly=NO Hidden=NO Shared=NO GraceTime=0 Preempt$<br>
          PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL
          PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO
          RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptM$<br>
          <br>
          pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"<br>
          node001: Thread(s) per core:    1<br>
          node001: Core(s) per socket:    12<br>
          node001: Socket(s):             2<br>
          node002: Thread(s) per core:    1<br>
          node002: Core(s) per socket:    12<br>
          node002: Socket(s):             2<br>
          node003: Thread(s) per core:    2<br>
          node003: Core(s) per socket:    12<br>
          node003: Socket(s):             2<br>
          <br>
          module load cmsh<br>
          [root@ciscluster kudyba]# cmsh<br>
          [ciscluster]% jobqueue<br>
          [ciscluster->jobqueue(slurm)]% ls<br>
          Type         Name                     Nodes<br>
          ------------ ------------------------
          ----------------------------------------------------<br>
          Slurm        defq                     node001..node003<br>
          Slurm        gpuq<br>
          <br>
          use defq<br>
          [ciscluster->jobqueue(slurm)->defq]% get options<br>
          QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12
          OverTimeLimit=0 State=UP<br>
          <br>
          scontrol show nodes node001<br>
          NodeName=node001 Arch=x86_64 CoresPerSocket=12<br>
             CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.07<br>
             AvailableFeatures=(null)<br>
             ActiveFeatures=(null)<br>
             Gres=gpu:1<br>
             NodeAddr=node001 NodeHostName=node001 Version=17.11<br>
             OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9
          18:05:47 UTC 2018<br>
             RealMemory=196489092 AllocMem=0 FreeMem=98557 Sockets=2
          Boards=1<br>
             State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
          Owner=N/A MCS_label=N/A<br>
             Partitions=defq<br>
             BootTime=2019-07-18T12:08:42
          SlurmdStartTime=2020-01-17T21:34:15<br>
             CfgTRES=cpu=24,mem=196489092M,billing=24<br>
             AllocTRES=<br>
             CapWatts=n/a<br>
             CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<br>
             ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s<br>
             Reason=Low RealMemory [slurm@2020-01-20T13:22:48]<br>
          <br>
          sinfo -R<br>
          REASON               USER      TIMESTAMP           NODELIST<br>
          Low RealMemory       slurm     2020-01-20T13:22:48
          node[001-003]<br>
        </div>
        <div><br>
        </div>
        <div>And the total memory in each node:</div>
        <div>ssh node001<br>
          Last login: Mon Jan 20 13:34:00 2020<br>
          [root@node001 ~]# free -h<br>
                        total        used        free      shared
           buff/cache   available<br>
          Mem:           187G         69G         96G        4.0G      
            21G        112G<br>
          Swap:           11G         11G         55M<br>
        </div>
        <div><br>
        </div>
        <div>What setting is incorrect here?</div>
      </div>
    </blockquote>
    <br>
    <pre class="moz-signature" cols="72">-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
<a class="moz-txt-link-abbreviated" href="mailto:wagner@itc.rwth-aachen.de">wagner@itc.rwth-aachen.de</a>
<a class="moz-txt-link-abbreviated" href="http://www.itc.rwth-aachen.de">www.itc.rwth-aachen.de</a>
</pre>
  </body>
</html>