[slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

Mon Feb 10 19:03:29 UTC 2020

Usually means you updated the slurm.conf but have not done "scontrol 
reconfigure" yet.

Brian Andrus

On 2/10/2020 8:55 AM, Robert Kudyba wrote:
> We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12.
>
> We're getting the below errors when I restart the slurmctld service. 
> The file appears to be the same on the head node and compute nodes:
> [root at node001 ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf
>
> -rw-r--r-- 1 root root 3477 Feb 10 11:05 
> /cm/shared/apps/slurm/var/etc/slurm.conf
>
> [root at ourcluster ~]# ls -l  /cm/shared/apps/slurm/var/etc/slurm.conf 
> /etc/slurm/slurm.conf
>
> -rw-r--r-- 1 root root 3477 Feb 10 11:05 
> /cm/shared/apps/slurm/var/etc/slurm.conf
>
> lrwxrwxrwx 1 root root   40 Nov 30  2018 /etc/slurm/slurm.conf -> 
> /cm/shared/apps/slurm/var/etc/slurm.conf
>
> So what else could be causing this?
> [2020-02-10T10:31:08.987] mcs: MCSParameters = (null). ondemand set.
> [2020-02-10T10:31:12.009] error: Node node001 appears to have a 
> different slurm.conf than the slurmctld.  This could cause issues with 
> communication and functionality.  Please review both files and make 
>  sure they are the same.  If this is expected ignore, and set 
> DebugFlags=NO_CONF_HASH in your slurm.conf.
> [2020-02-10T10:31:12.009] error: Node node001 has low real_memory size 
> (191846 < 196489092)
> [2020-02-10T10:31:12.009] error: _slurm_rpc_node_registration 
> node=node001: Invalid argument
> [2020-02-10T10:31:12.011] error: Node node002 appears to have a 
> different slurm.conf than the slurmctld.  This could cause issues with 
> communication and functionality.  Please review both files and 
> make sure they are the same.  If this is expected ignore, and set 
> DebugFlags=NO_CONF_HASH in your slurm.conf.
> [2020-02-10T10:31:12.011] error: Node node002 has low real_memory size 
> (191840 < 196489092)
> [2020-02-10T10:31:12.011] error: _slurm_rpc_node_registration 
> node=node002: Invalid argument
> [2020-02-10T10:31:12.047] error: Node node003 appears to have a 
> different slurm.conf than the slurmctld.  This could cause issues with 
> communication and functionality.  Please review both files and 
> make sure they are the same.  If this is expected ignore, and set 
> DebugFlags=NO_CONF_HASH in your slurm.conf.
> [2020-02-10T10:31:12.047] error: Node node003 has low real_memory size 
> (191840 < 196489092)
> [2020-02-10T10:31:12.047] error: Setting node node003 state to DRAIN
> [2020-02-10T10:31:12.047] drain_nodes: node node003 state set to DRAIN
> [2020-02-10T10:31:12.047] error: _slurm_rpc_node_registration 
> node=node003: Invalid argument
> [2020-02-10T10:32:08.026] 
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
> [2020-02-10T10:56:08.988] Processing RPC: REQUEST_RECONFIGURE from uid=0
> [2020-02-10T10:56:08.992] layouts: no layout to initialize
> [2020-02-10T10:56:08.992] restoring original state of nodes
> [2020-02-10T10:56:08.992] cons_res: select_p_node_init
> [2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
> [2020-02-10T10:56:08.992] _preserve_plugins: backup_controller not 
> specified
> [2020-02-10T10:56:08.992] cons_res: select_p_reconfigure
> [2020-02-10T10:56:08.992] cons_res: select_p_node_init
> [2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
> [2020-02-10T10:56:08.992] No parameter for mcs plugin, default values set
> [2020-02-10T10:56:08.992] mcs: MCSParameters = (null). ondemand set.
> [2020-02-10T10:56:08.992] _slurm_rpc_reconfigure_controller: completed 
> usec=4369
> [2020-02-10T10:56:11.253] 
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
> [2020-02-10T10:56:18.645] update_node: node node001 reason set to: hung
> [2020-02-10T10:56:18.645] update_node: node node001 state set to DOWN
> [2020-02-10T10:56:18.645] got (nil)
> [2020-02-10T10:56:18.679] update_node: node node001 state set to IDLE
> [2020-02-10T10:56:18.679] got (nil)
> [2020-02-10T10:56:18.693] update_node: node node002 reason set to: hung
> [2020-02-10T10:56:18.693] update_node: node node002 state set to DOWN
> [2020-02-10T10:56:18.693] got (nil)
> [2020-02-10T10:56:18.711] update_node: node node002 state set to IDLE
> [2020-02-10T10:56:18.711] got (nil)
>
> And not sure if this is related but we're getting this  "Kill task 
> failed" and a node gets drained.
>
> [2020-02-09T14:42:06.006] error: slurmd error running JobId=1465 on 
> node(s)=node001: Kill task failed
> [2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x1 
> NodeCnt=1 WEXITSTATUS 1
> [2020-02-09T14:42:06.006] email msg to ouruser at ourdomain.edu 
> <mailto:ouruser at ourdomain.edu>: SLURM Job_id=1465 Name=run.sh Failed, 
> Run time 00:02:23, NODE_FAIL, ExitCode 0
> [2020-02-09T14:42:06.006] _job_complete: requeue JobID=1465 
> State=0x8000 NodeCnt=1 per user/system request
> [2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x8000 
> NodeCnt=1 done
> [2020-02-09T14:42:06.017] Requeuing JobID=1465 State=0x0 NodeCnt=0
> [2020-02-09T14:43:16.308] backfill: Started JobID=1466 in defq on node003
> [2020-02-09T14:43:17.054] prolog_running_decr: Configuration for 
> JobID=1466 is complete
> [2020-02-09T14:44:16.309] email msg to ouruser at ourdomain.edu 
> <mailto:ouruser at ourdomain.edu>:: SLURM Job_id=1461 Name=run.sh Began, 
> Queued time 00:02:14
> [2020-02-09T14:44:16.309] backfill: Started JobID=1461 in defq on node003
> [2020-02-09T14:44:16.309] email msg to ouruser at ourdomain.edu 
> <mailto:ouruser at ourdomain.edu>:: SLURM Job_id=1465 Name=run.sh Began, 
> Queued time 00:02:10
> [2020-02-09T14:44:16.309] backfill: Started JobID=1465 in defq on node003
> [2020-02-09T14:44:16.850] prolog_running_decr: Configuration for 
> JobID=1461 is complete
> [2020-02-09T14:44:17.040] prolog_running_decr: Configuration for 
> JobID=1465 is complete
> [2020-02-09T14:44:27.016] error: slurmd error running JobId=1466 on 
> node(s)=node003: Kill task failed
> [2020-02-09T14:44:27.016] drain_nodes: node node003 state set to DRAIN
> [2020-02-09T14:44:27.016] _job_complete: JobID=1466 State=0x1 
> NodeCnt=1 WEXITSTATUS 1
> [2020-02-09T14:44:27.016] _job_complete: requeue JobID=1466 
> State=0x8000 NodeCnt=1 per user/system request
> [2020-02-09T14:44:27.017] _job_complete: JobID=1466 State=0x8000 
> NodeCnt=1 done
> [2020-02-09T14:44:27.057] Requeuing JobID=1466 State=0x0 NodeCnt=0
> [2020-02-09T14:44:27.081] update_node: node node003 reason set to: 
> Kill task failed
> [2020-02-09T14:44:27.082] update_node: node node003 state set to DRAINING
> [2020-02-09T14:44:27.082] got (nil)
> [2020-02-09T14:45:33.098] _job_complete: JobID=1461 State=0x1 
> NodeCnt=1 WEXITSTATUS 1
> [2020-02-09T14:45:33.098] email msg to ouruser at ourdomain.edu 
> <mailto:ouruser at ourdomain.edu>:: SLURM Job_id=1461 Name=run.sh Failed, 
> Run time 00:01:17, FAILED, ExitCode 1
>
> Could this be related to https://bugs.schedmd.com/show_bug.cgi?id=6307?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200210/29237fe5/attachment-0001.htm>