<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Usually means you updated the slurm.conf but have not done

      "scontrol reconfigure" yet.</p>

    <p><br>

      Brian Andrus<br>

    </p>

    <div class="moz-cite-prefix">On 2/10/2020 8:55 AM, Robert Kudyba

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAFHi+KQ6xWuzcjckWe1oV4osaLfLx9XcRbH_i9PH4tmukWM5VQ@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">

        <div>We are using Bright Cluster 8.1 with and just upgraded

          to slurm-17.11.12.</div>

        <div><br>

        </div>

        We're getting the below errors when I restart the slurmctld

        service. The file appears to be the same on the head node and

        compute nodes:<br>

        [root@node001 ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf<br>

        <br>

        -rw-r--r-- 1 root root 3477 Feb 10 11:05

        /cm/shared/apps/slurm/var/etc/slurm.conf<br>

        <br>

        [root@ourcluster ~]# ls -l

         /cm/shared/apps/slurm/var/etc/slurm.conf /etc/slurm/slurm.conf<br>

        <br>

        -rw-r--r-- 1 root root 3477 Feb 10 11:05

        /cm/shared/apps/slurm/var/etc/slurm.conf<br>

        <br>

        lrwxrwxrwx 1 root root   40 Nov 30  2018 /etc/slurm/slurm.conf

        -> /cm/shared/apps/slurm/var/etc/slurm.conf<br>

        <br>

        So what else could be causing this?<br>

        [2020-02-10T10:31:08.987] mcs: MCSParameters = (null). ondemand

        set.<br>

        [2020-02-10T10:31:12.009] error: Node node001 appears to have a

        different slurm.conf than the slurmctld.  This could cause

        issues with communication and functionality.  Please review both

        files and make  sure they are the same.  If this is expected

        ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.<br>

        [2020-02-10T10:31:12.009] error: Node node001 has low

        real_memory size (191846 < 196489092)<br>

        [2020-02-10T10:31:12.009] error: _slurm_rpc_node_registration

        node=node001: Invalid argument<br>

        [2020-02-10T10:31:12.011] error: Node node002 appears to have a

        different slurm.conf than the slurmctld.  This could cause

        issues with communication and functionality.  Please review both

        files and make sure they are the same.  If this is expected

        ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.<br>

        [2020-02-10T10:31:12.011] error: Node node002 has low

        real_memory size (191840 < 196489092)<br>

        [2020-02-10T10:31:12.011] error: _slurm_rpc_node_registration

        node=node002: Invalid argument<br>

        [2020-02-10T10:31:12.047] error: Node node003 appears to have a

        different slurm.conf than the slurmctld.  This could cause

        issues with communication and functionality.  Please review both

        files and make sure they are the same.  If this is expected

        ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.<br>

        [2020-02-10T10:31:12.047] error: Node node003 has low

        real_memory size (191840 < 196489092)<br>

        [2020-02-10T10:31:12.047] error: Setting node node003 state to

        DRAIN<br>

        [2020-02-10T10:31:12.047] drain_nodes: node node003 state set to

        DRAIN<br>

        [2020-02-10T10:31:12.047] error: _slurm_rpc_node_registration

        node=node003: Invalid argument<br>

        [2020-02-10T10:32:08.026]

SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2<br>

        [2020-02-10T10:56:08.988] Processing RPC: REQUEST_RECONFIGURE

        from uid=0<br>

        [2020-02-10T10:56:08.992] layouts: no layout to initialize<br>

        [2020-02-10T10:56:08.992] restoring original state of nodes<br>

        [2020-02-10T10:56:08.992] cons_res: select_p_node_init<br>

        [2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions<br>

        [2020-02-10T10:56:08.992] _preserve_plugins: backup_controller

        not specified<br>

        [2020-02-10T10:56:08.992] cons_res: select_p_reconfigure<br>

        [2020-02-10T10:56:08.992] cons_res: select_p_node_init<br>

        [2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions<br>

        [2020-02-10T10:56:08.992] No parameter for mcs plugin, default

        values set<br>

        [2020-02-10T10:56:08.992] mcs: MCSParameters = (null). ondemand

        set.<br>

        [2020-02-10T10:56:08.992] _slurm_rpc_reconfigure_controller:

        completed usec=4369<br>

        [2020-02-10T10:56:11.253]

SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2<br>

        [2020-02-10T10:56:18.645] update_node: node node001 reason set

        to: hung<br>

        [2020-02-10T10:56:18.645] update_node: node node001 state set to

        DOWN<br>

        [2020-02-10T10:56:18.645] got (nil)<br>

        [2020-02-10T10:56:18.679] update_node: node node001 state set to

        IDLE<br>

        [2020-02-10T10:56:18.679] got (nil)<br>

        [2020-02-10T10:56:18.693] update_node: node node002 reason set

        to: hung<br>

        [2020-02-10T10:56:18.693] update_node: node node002 state set to

        DOWN<br>

        [2020-02-10T10:56:18.693] got (nil)<br>

        [2020-02-10T10:56:18.711] update_node: node node002 state set to

        IDLE<br>

        [2020-02-10T10:56:18.711] got (nil)<br>

        <br>

        And not sure if this is related but we're getting this  "Kill

        task failed" and a node gets drained.<br>

        <br>

        [2020-02-09T14:42:06.006] error: slurmd error running JobId=1465

        on node(s)=node001: Kill task failed<br>

        [2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x1

        NodeCnt=1 WEXITSTATUS 1<br>

        [2020-02-09T14:42:06.006] email msg to <a

          href="mailto:ouruser@ourdomain.edu" moz-do-not-send="true">ouruser@ourdomain.edu</a>:

        SLURM Job_id=1465 Name=run.sh Failed, Run time 00:02:23,

        NODE_FAIL, ExitCode 0<br>

        [2020-02-09T14:42:06.006] _job_complete: requeue JobID=1465

        State=0x8000 NodeCnt=1 per user/system request<br>

        [2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x8000

        NodeCnt=1 done<br>

        [2020-02-09T14:42:06.017] Requeuing JobID=1465 State=0x0

        NodeCnt=0<br>

        [2020-02-09T14:43:16.308] backfill: Started JobID=1466 in defq

        on node003<br>

        [2020-02-09T14:43:17.054] prolog_running_decr: Configuration for

        JobID=1466 is complete<br>

        [2020-02-09T14:44:16.309] email msg to <a

          href="mailto:ouruser@ourdomain.edu" moz-do-not-send="true">ouruser@ourdomain.edu</a>::

        SLURM Job_id=1461 Name=run.sh Began, Queued time 00:02:14<br>

        [2020-02-09T14:44:16.309] backfill: Started JobID=1461 in defq

        on node003<br>

        [2020-02-09T14:44:16.309] email msg to <a

          href="mailto:ouruser@ourdomain.edu" moz-do-not-send="true">ouruser@ourdomain.edu</a>::

        SLURM Job_id=1465 Name=run.sh Began, Queued time 00:02:10<br>

        [2020-02-09T14:44:16.309] backfill: Started JobID=1465 in defq

        on node003<br>

        [2020-02-09T14:44:16.850] prolog_running_decr: Configuration for

        JobID=1461 is complete<br>

        [2020-02-09T14:44:17.040] prolog_running_decr: Configuration for

        JobID=1465 is complete<br>

        [2020-02-09T14:44:27.016] error: slurmd error running JobId=1466

        on node(s)=node003: Kill task failed<br>

        [2020-02-09T14:44:27.016] drain_nodes: node node003 state set to

        DRAIN<br>

        [2020-02-09T14:44:27.016] _job_complete: JobID=1466 State=0x1

        NodeCnt=1 WEXITSTATUS 1<br>

        [2020-02-09T14:44:27.016] _job_complete: requeue JobID=1466

        State=0x8000 NodeCnt=1 per user/system request<br>

        [2020-02-09T14:44:27.017] _job_complete: JobID=1466 State=0x8000

        NodeCnt=1 done<br>

        [2020-02-09T14:44:27.057] Requeuing JobID=1466 State=0x0

        NodeCnt=0<br>

        [2020-02-09T14:44:27.081] update_node: node node003 reason set

        to: Kill task failed<br>

        [2020-02-09T14:44:27.082] update_node: node node003 state set to

        DRAINING<br>

        [2020-02-09T14:44:27.082] got (nil)<br>

        [2020-02-09T14:45:33.098] _job_complete: JobID=1461 State=0x1

        NodeCnt=1 WEXITSTATUS 1<br>

        [2020-02-09T14:45:33.098] email msg to <a

          href="mailto:ouruser@ourdomain.edu" moz-do-not-send="true">ouruser@ourdomain.edu</a>::

        SLURM Job_id=1461 Name=run.sh Failed, Run time 00:01:17, FAILED,

        ExitCode 1<br>

        <div><br>

        </div>

        <div>Could this be related to <a

            href="https://bugs.schedmd.com/show_bug.cgi?id=6307"

            moz-do-not-send="true">https://bugs.schedmd.com/show_bug.cgi?id=6307</a>?</div>

      </div>

    </blockquote>

  </body>

</html>