<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>After you restart slurmctld do "scontrol reconfigure"</p>
    <p>Brian Andrus<br>
    </p>
    <div class="moz-cite-prefix">On 8/30/2019 6:57 AM, Robert Kudyba
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAFHi+KRb=JUqxFeYVENV5m_524JAMZ0-FF8k9n7KP43D8XOmZw@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">I had set RealMemory to a really high number as I
        mis-interpreted the recommendation.
        <div>NodeName=node[001-003]  CoresPerSocket=12 RealMemory=
          196489092  Sockets=2 Gres=gpu:1</div>
        <div><br>
        </div>
        <div>But now I set it to:</div>
        <div>RealMemory=191000<br>
          <div><br>
          </div>
          <div>I restarted slurmctld. And according to the Bright
            Cluster support team:</div>
          <div>"Unless it has been overridden in the image, the nodes
            will have a symlink directly to the slurm.conf on the head
            node. This means that any changes made to the file on the
            head node will automatically be available to the compute
            nodes. All they would need in that case is to have slurmd
            restarted"</div>
          <div><br>
          </div>
          <div>But now I see these errors:</div>
          <div><br>
          </div>
          <div>mcs: MCSParameters = (null). ondemand set.<br>
            [2019-08-30T09:22:41.700] error: Node node001 appears to
            have a different slurm.conf than the slurmctld.  This could
            cause issues with communication and functionality.  Please
            review both files and make sure they are the same.  If this
            is expected ignore, and set DebugFlags=NO_CONF_HASH in your
            slurm.conf.<br>
            [2019-08-30T09:22:41.700] error: Node node002 appears to
            have a different slurm.conf than the slurmctld.  This could
            cause issues with communication and functionality.  Please
            review both files and make sure they are the same.  If this
            is expected ignore, and set DebugFlags=NO_CONF_HASH in your
            slurm.conf.<br>
            [2019-08-30T09:22:41.701] error: Node node003 appears to
            have a different slurm.conf than the slurmctld.  This could
            cause issues with communication and functionality.  Please
            review both files and make sure they are the same.  If this
            is expected ignore, and set DebugFlags=NO_CONF_HASH in your
            slurm.conf.<br>
            [2019-08-30T09:23:16.347] update_node: node node001 state
            set to IDLE<br>
            [2019-08-30T09:23:16.347] got (nil)<br>
            [2019-08-30T09:23:16.766]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2<br>
            [2019-08-30T09:23:19.082] update_node: node node002 state
            set to IDLE<br>
            [2019-08-30T09:23:19.082] got (nil)<br>
            [2019-08-30T09:23:20.929] update_node: node node003 state
            set to IDLE<br>
            [2019-08-30T09:23:20.929] got (nil)<br>
            [2019-08-30T09:45:46.314] _slurm_rpc_submit_batch_job:
            JobId=449 InitPrio=4294901759 usec=355<br>
            [2019-08-30T09:45:46.430] sched: Allocate JobID=449
            NodeList=node[001-003] #CPUs=30 Partition=defq<br>
            [2019-08-30T09:45:46.670] prolog_running_decr: Configuration
            for JobID=449 is complete<br>
            [2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x1
            NodeCnt=3 WEXITSTATUS 127<br>
            [2019-08-30T09:45:46.772] _job_complete: JobID=449
            State=0x8005 NodeCnt=3 done<br>
          </div>
        </div>
        <div><br>
        </div>
        <div>Is this another option that needs to be set?</div>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Thu, Aug 29, 2019 at 3:27
          PM Alex Chekholko <<a href="mailto:alex@calicolabs.com"
            target="_blank" moz-do-not-send="true">alex@calicolabs.com</a>>
          wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px
          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div dir="ltr">Sounds like maybe you didn't correctly roll out
            / update your slurm.conf everywhere as your RealMemory value
            is back to your large wrong number.  You need to update your
            slurm.conf everywhere and restart all the slurm daemons.
            <div><br>
            </div>
            <div>I recommend the "safe procedure" from here: <a
href="https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.fysik.dtu.dk_niflheim_SLURM-23add-2Dand-2Dremove-2Dnodes&d=DwMFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yUZtCS8lFs9N4Dm1nidebq1bpGa9QMJUap7ZWVR8NVg&s=Fq72zWoETitTA7ayJCyYkbp8E1fInntp4YeBv75o7vU&e="
                target="_blank" moz-do-not-send="true">https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes</a></div>
            <div>Your Bright manual may have a similar process for
              updating SLURM config "the Bright way".</div>
          </div>
          <br>
          <div class="gmail_quote">
            <div dir="ltr" class="gmail_attr">On Thu, Aug 29, 2019 at
              12:20 PM Robert Kudyba <<a
                href="mailto:rkudyba@fordham.edu" target="_blank"
                moz-do-not-send="true">rkudyba@fordham.edu</a>>
              wrote:<br>
            </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px
              0.8ex;border-left:1px solid
              rgb(204,204,204);padding-left:1ex">I thought I had taken
              care of this a while back but it appears the issue has
              returned. A very simply sbatch slurmhello.sh:<br>
               cat slurmhello.sh<br>
              #!/bin/sh<br>
              #SBATCH -o my.stdout<br>
              #SBATCH -N 3<br>
              #SBATCH --ntasks=16<br>
              module add shared openmpi/gcc/64/1.10.7 slurm<br>
              mpirun hello<br>
              <br>
              sbatch slurmhello.sh<br>
              Submitted batch job 419<br>
              <br>
              squeue<br>
                           JOBID PARTITION     NAME     USER ST     
               TIME  NODES NODELIST(REASON)<br>
                             419      defq slurmhel     root PD     
               0:00      3 (Resources)<br>
              <br>
              In /etc/slurm/slurm.conf:<br>
              # Nodes<br>
              NodeName=node[001-003]  CoresPerSocket=12
              RealMemory=196489092 Sockets=2 Gres=gpu:1<br>
              <br>
              Logs show:<br>
              [2019-08-29T14:24:40.025] error:
              _slurm_rpc_node_registration node=node001: Invalid
              argument<br>
              [2019-08-29T14:24:40.025] error: Node node002 has low
              real_memory size (191840 < 196489092)<br>
              [2019-08-29T14:24:40.025] error:
              _slurm_rpc_node_registration node=node002: Invalid
              argument<br>
              [2019-08-29T14:24:40.026] error: Node node003 has low
              real_memory size (191840 < 196489092)<br>
              [2019-08-29T14:24:40.026] error:
              _slurm_rpc_node_registration node=node003: Invalid
              argument<br>
              <br>
              scontrol show jobid -dd 419<br>
              JobId=419 JobName=slurmhello.sh<br>
                 UserId=root(0) GroupId=root(0) MCS_label=N/A<br>
                 Priority=4294901759 Nice=0 Account=root QOS=normal<br>
                 JobState=PENDING Reason=Resources Dependency=(null)<br>
                 Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0<br>
                 DerivedExitCode=0:0<br>
                 RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A<br>
                 SubmitTime=2019-08-28T09:54:22
              EligibleTime=2019-08-28T09:54:22<br>
                 StartTime=Unknown EndTime=Unknown Deadline=N/A<br>
                 PreemptTime=None SuspendTime=None SecsPreSuspend=0<br>
                 LastSchedEval=2019-08-28T09:57:22<br>
                 Partition=defq AllocNode:Sid=ourcluster:194152<br>
                 ReqNodeList=(null) ExcNodeList=(null)<br>
                 NodeList=(null)<br>
                 NumNodes=3-3 NumCPUs=16 NumTasks=16 CPUs/Task=1
              ReqB:S:C:T=0:0:*:*<br>
                 TRES=cpu=16,node=3<br>
                 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*<br>
                 MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0<br>
                 Features=(null) DelayBoot=00:00:00<br>
                 Gres=(null) Reservation=(null)<br>
                 OverSubscribe=YES Contiguous=0 Licenses=(null)
              Network=(null)<br>
                 Command=/root/slurmhello.sh<br>
                 WorkDir=/root<br>
                 StdErr=/root/my.stdout<br>
                 StdIn=/dev/null<br>
                 StdOut=/root/my.stdout<br>
                 Power=<br>
              <br>
              scontrol show nodes node001<br>
              NodeName=node001 Arch=x86_64 CoresPerSocket=12<br>
                 CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.06<br>
                 AvailableFeatures=(null)<br>
                 ActiveFeatures=(null)<br>
                 Gres=gpu:1<br>
                 NodeAddr=node001 NodeHostName=node001 Version=17.11<br>
                 OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9
              18:05:47 UTC 2018<br>
                 RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2
              Boards=1<br>
                 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
              Owner=N/A MCS_label=N/A<br>
                 Partitions=defq<br>
                 BootTime=2019-07-18T12:08:41
              SlurmdStartTime=2019-07-18T12:09:44<br>
                 CfgTRES=cpu=24,mem=196489092M,billing=24<br>
                 AllocTRES=<br>
                 CapWatts=n/a<br>
                 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0<br>
                 ExtSensorsJoules=n/s ExtSensorsWatts=0
              ExtSensorsTemp=n/s<br>
                 Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
              <br>
              [root@ciscluster ~]# scontrol show nodes| grep -i mem<br>
                 RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2
              Boards=1<br>
                 CfgTRES=cpu=24,mem=196489092M,billing=24<br>
                 Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
                 RealMemory=196489092 AllocMem=0 FreeMem=180969
              Sockets=2 Boards=1<br>
                 CfgTRES=cpu=24,mem=196489092M,billing=24<br>
                 Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
                 RealMemory=196489092 AllocMem=0 FreeMem=178999
              Sockets=2 Boards=1<br>
                 CfgTRES=cpu=24,mem=196489092M,billing=24<br>
                 Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
              <br>
              sinfo -R<br>
              REASON               USER      TIMESTAMP         
               NODELIST<br>
              Low RealMemory       slurm     2019-07-18T10:17:24
              node[001-003]<br>
              <br>
              sinfo -N<br>
              NODELIST   NODES PARTITION STATE<br>
              node001        1     defq* drain<br>
              node002        1     defq* drain<br>
              node003        1     defq* drain<br>
              <br>
              pdsh -w node00[1-3]  "lscpu | grep -iE 'socket|core'"<br>
              node002: Thread(s) per core:    1<br>
              node002: Core(s) per socket:    12<br>
              node002: Socket(s):             2<br>
              node001: Thread(s) per core:    1<br>
              node001: Core(s) per socket:    12<br>
              node001: Socket(s):             2<br>
              node003: Thread(s) per core:    2<br>
              node003: Core(s) per socket:    12<br>
              node003: Socket(s):             2<br>
              <br>
              scontrol show nodes| grep -i mem<br>
                 RealMemory=196489092 AllocMem=0 FreeMem=100054
              Sockets=2 Boards=1<br>
                 CfgTRES=cpu=24,mem=196489092M,billing=24<br>
                 Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
                 RealMemory=196489092 AllocMem=0 FreeMem=181101
              Sockets=2 Boards=1<br>
                 CfgTRES=cpu=24,mem=196489092M,billing=24<br>
                 Reason=Low RealMemory [slurm@2019-07-18T10:17:24]<br>
                 RealMemory=196489092 AllocMem=0 FreeMem=179004
              Sockets=2 Boards=1<br>
                 CfgTRES=cpu=24,mem=196489092M,billing=24<br>
                 Reason=Low RealMemory<br>
              <br>
              Does anything look off?<br>
            </blockquote>
          </div>
        </blockquote>
      </div>
    </blockquote>
  </body>
</html>