Hi,

As suggested,

8><---

Stop their services, start them manually one by one (ctld first), then

watch whether they talk to each other, and if they don't, learn what stops

them from doing so - then iterate editing the config, "scontrol reconfig",

lather, rinse, repeat.

8><---

Error logs,

On node3,

========

[2024-12-09T21:43:10.694] slurmd started on Mon, 09 Dec 2024 21:43:10 +0000

[2024-12-09T21:43:10.694] CPUs=20 Boards=1 Sockets=20 Cores=1 Threads=1 Memory=48269 TmpDisk=23308 Uptime=3703 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

[2024-12-09T23:38:56.645] error: Munge decode failed: Rewound credential

[2024-12-09T23:38:56.645] auth/munge: _print_cred: ENCODED: Tue Dec 10 23:38:30 2024

[2024-12-09T23:38:56.645] auth/munge: _print_cred: DECODED: Mon Dec 09 23:38:56 2024

[2024-12-09T23:38:56.645] error: Check for out of sync clocks

[2024-12-09T23:38:56.645] error: slurm_receive_msg_and_forward: g_slurm_auth_verify: REQUEST_PING has authentication error: Invalid authentication credential

[2024-12-09T23:38:56.645] error: slurm_receive_msg_and_forward: Protocol authentication error

[2024-12-09T23:38:56.655] error: service_connection: slurm_receive_msg: Protocol authentication error

[2024-12-10T01:13:18.454] Slurmd shutdown completing

======

I have checked the time, this is fine

munge.key I ran md5sum over it and get the same output.

When I reboot the warewulf nodes its a common Container file so 1~2 working but not 3 is just wierd.

On the controller,

=========

[2024-12-10T01:30:35.108] Node node2 appears to have a different version of Slurm than ours. Please update at your earliest convenience.

[2024-12-10T01:30:35.109] Node node1 appears to have a different version of Slurm than ours. Please update at your earliest convenience.

[2024-12-10T01:30:35.109] Node node6 appears to have a different version of Slurm than ours. Please update at your earliest convenience.

[2024-12-10T01:30:35.109] Node node7 appears to have a different version of Slurm than ours. Please update at your earliest convenience.

[2024-12-10T01:30:35.110] Node node3 appears to have a different version of Slurm than ours. Please update at your earliest convenience.

[2024-12-10T01:30:35.110] Node node5 appears to have a different version of Slurm than ours. Please update at your earliest convenience.

[2024-12-10T01:30:35.110] Node node4 appears to have a different version of Slurm than ours. Please update at your earliest convenience.

[2024-12-10T01:30:36.104] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

==========

So the controller is RHEL9 and the nodes Rocky8

However 1~2 and 4~7 work OK.

When I run,

[root@vuwunicoslurmd1 log]# scontrol show node node3

NodeName=node3 Arch=x86_64 CoresPerSocket=1

CPUAlloc=0 CPUEfctv=20 CPUTot=20 CPULoad=0.00

AvailableFeatures=(null)

ActiveFeatures=(null)

Gres=(null)

NodeAddr=node3 NodeHostName=node3 Version=20.11.9

OS=Linux 4.18.0-553.30.1.el8_10.x86_64 #1 SMP Tue Nov 26 18:56:25 UTC 2024

RealMemory=48000 AllocMem=0 FreeMem=43378 Sockets=20 Boards=1

State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A

Partitions=debug

BootTime=2024-12-09T20:41:27 SlurmdStartTime=2024-12-10T01:13:25

LastBusyTime=2024-12-10T01:30:31

CfgTRES=cpu=20,mem=48000M,billing=20

AllocTRES=

CapWatts=n/a

CurrentWatts=0 AveWatts=0

ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Reason=Node unexpectedly rebooted [slurm@2024-12-09T20:40:27]

[root@vuwunicoslurmd1 log]#

I get a reason, its like the node is held down / off, does this need to be / can this be cleared?

regards

Steven