node3 not working - down - slurm-users

10 Dec 2024


      Hi,
As suggested,
8><---
Stop their services, start them manually one by one (ctld first), then
watch whether they talk to each other, and if they don't, learn what stops
them from doing so - then iterate editing the config, "scontrol reconfig",
lather, rinse, repeat.
8><---
Error logs,
On node3,
========
[2024-12-09T21:43:10.694] slurmd started on Mon, 09 Dec 2024 21:43:10 +0000
[2024-12-09T21:43:10.694] CPUs=20 Boards=1 Sockets=20 Cores=1 Threads=1 Memory=48269 TmpDisk=23308 Uptime=3703 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2024-12-09T23:38:56.645] error: Munge decode failed: Rewound credential
[2024-12-09T23:38:56.645] auth/munge: _print_cred: ENCODED: Tue Dec 10 23:38:30 2024
[2024-12-09T23:38:56.645] auth/munge: _print_cred: DECODED: Mon Dec 09 23:38:56 2024
[2024-12-09T23:38:56.645] error: Check for out of sync clocks
[2024-12-09T23:38:56.645] error: slurm_receive_msg_and_forward: g_slurm_auth_verify: REQUEST_PING has authentication error: Invalid authentication credential
[2024-12-09T23:38:56.645] error: slurm_receive_msg_and_forward: Protocol authentication error
[2024-12-09T23:38:56.655] error: service_connection: slurm_receive_msg: Protocol authentication error
[2024-12-10T01:13:18.454] Slurmd shutdown completing
======
I have checked the time, this is fine
munge.key I ran md5sum over it and get the same output.
When I reboot the warewulf nodes its a common Container file so 1~2 working but not 3 is just wierd.
On the controller,
=========
[2024-12-10T01:30:35.108] Node node2 appears to have a different version of Slurm than ours.  Please update at your earliest convenience.
[2024-12-10T01:30:35.109] Node node1 appears to have a different version of Slurm than ours.  Please update at your earliest convenience.
[2024-12-10T01:30:35.109] Node node6 appears to have a different version of Slurm than ours.  Please update at your earliest convenience.
[2024-12-10T01:30:35.109] Node node7 appears to have a different version of Slurm than ours.  Please update at your earliest convenience.
[2024-12-10T01:30:35.110] Node node3 appears to have a different version of Slurm than ours.  Please update at your earliest convenience.
[2024-12-10T01:30:35.110] Node node5 appears to have a different version of Slurm than ours.  Please update at your earliest convenience.
[2024-12-10T01:30:35.110] Node node4 appears to have a different version of Slurm than ours.  Please update at your earliest convenience.
[2024-12-10T01:30:36.104] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
==========
So the controller is RHEL9 and the nodes Rocky8
However 1~2 and 4~7 work OK.
When I run,
[root@vuwunicoslurmd1 log]# scontrol show node node3
NodeName=node3 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUEfctv=20 CPUTot=20 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=node3 NodeHostName=node3 Version=20.11.9
   OS=Linux 4.18.0-553.30.1.el8_10.x86_64 #1 SMP Tue Nov 26 18:56:25 UTC 2024
   RealMemory=48000 AllocMem=0 FreeMem=43378 Sockets=20 Boards=1
   State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug
   BootTime=2024-12-09T20:41:27 SlurmdStartTime=2024-12-10T01:13:25
   LastBusyTime=2024-12-10T01:30:31
   CfgTRES=cpu=20,mem=48000M,billing=20
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Node unexpectedly rebooted [slurm@2024-12-09T20:40:27]
[root@vuwunicoslurmd1 log]#
I get a reason, its like the node is held down / off, does this need to be / can this be cleared?
regards
Steven