Hi,
As suggested,
8><--- Stop their services, start them manually one by one (ctld first), then watch whether they talk to each other, and if they don't, learn what stops them from doing so - then iterate editing the config, "scontrol reconfig", lather, rinse, repeat. 8><---
Error logs,
On node3,
========
[2024-12-09T21:43:10.694] slurmd started on Mon, 09 Dec 2024 21:43:10 +0000 [2024-12-09T21:43:10.694] CPUs=20 Boards=1 Sockets=20 Cores=1 Threads=1 Memory=48269 TmpDisk=23308 Uptime=3703 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2024-12-09T23:38:56.645] error: Munge decode failed: Rewound credential [2024-12-09T23:38:56.645] auth/munge: _print_cred: ENCODED: Tue Dec 10 23:38:30 2024 [2024-12-09T23:38:56.645] auth/munge: _print_cred: DECODED: Mon Dec 09 23:38:56 2024 [2024-12-09T23:38:56.645] error: Check for out of sync clocks [2024-12-09T23:38:56.645] error: slurm_receive_msg_and_forward: g_slurm_auth_verify: REQUEST_PING has authentication error: Invalid authentication credential [2024-12-09T23:38:56.645] error: slurm_receive_msg_and_forward: Protocol authentication error [2024-12-09T23:38:56.655] error: service_connection: slurm_receive_msg: Protocol authentication error [2024-12-10T01:13:18.454] Slurmd shutdown completing
======
I have checked the time, this is fine
munge.key I ran md5sum over it and get the same output.
When I reboot the warewulf nodes its a common Container file so 1~2 working but not 3 is just wierd.
On the controller,
========= [2024-12-10T01:30:35.108] Node node2 appears to have a different version of Slurm than ours. Please update at your earliest convenience. [2024-12-10T01:30:35.109] Node node1 appears to have a different version of Slurm than ours. Please update at your earliest convenience. [2024-12-10T01:30:35.109] Node node6 appears to have a different version of Slurm than ours. Please update at your earliest convenience. [2024-12-10T01:30:35.109] Node node7 appears to have a different version of Slurm than ours. Please update at your earliest convenience. [2024-12-10T01:30:35.110] Node node3 appears to have a different version of Slurm than ours. Please update at your earliest convenience. [2024-12-10T01:30:35.110] Node node5 appears to have a different version of Slurm than ours. Please update at your earliest convenience. [2024-12-10T01:30:35.110] Node node4 appears to have a different version of Slurm than ours. Please update at your earliest convenience. [2024-12-10T01:30:36.104] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 ==========
So the controller is RHEL9 and the nodes Rocky8
However 1~2 and 4~7 work OK.
When I run,
[root@vuwunicoslurmd1 log]# scontrol show node node3 NodeName=node3 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUEfctv=20 CPUTot=20 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=node3 NodeHostName=node3 Version=20.11.9 OS=Linux 4.18.0-553.30.1.el8_10.x86_64 #1 SMP Tue Nov 26 18:56:25 UTC 2024 RealMemory=48000 AllocMem=0 FreeMem=43378 Sockets=20 Boards=1 State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2024-12-09T20:41:27 SlurmdStartTime=2024-12-10T01:13:25 LastBusyTime=2024-12-10T01:30:31 CfgTRES=cpu=20,mem=48000M,billing=20 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Node unexpectedly rebooted [slurm@2024-12-09T20:40:27]
[root@vuwunicoslurmd1 log]#
I get a reason, its like the node is held down / off, does this need to be / can this be cleared?
regards
Steven
On 9/12/24 5:44 pm, Steven Jones via slurm-users wrote:
[2024-12-09T23:38:56.645] error: Munge decode failed: Rewound credential [2024-12-09T23:38:56.645] auth/munge: _print_cred: ENCODED: Tue Dec 10 23:38:30 2024 [2024-12-09T23:38:56.645] auth/munge: _print_cred: DECODED: Mon Dec 09 23:38:56 2024 [2024-12-09T23:38:56.645] error: Check for out of sync clocks
One system is 24 hours behind/ahead of the other.
You should make sure NTP is set up and working on all these nodes.
Thanks,
The times were correct via chrony but the timezones were UTC and NZDT which was the issue. Oddly nodes 1 and 2 didnt care about that, only no3 ***shrug***
regards
Steven
________________________________ From: Chris Samuel via slurm-users slurm-users@lists.schedmd.com Sent: Tuesday, 10 December 2024 4:19 pm To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: node3 not working - down
On 9/12/24 5:44 pm, Steven Jones via slurm-users wrote:
[2024-12-09T23:38:56.645] error: Munge decode failed: Rewound credential [2024-12-09T23:38:56.645] auth/munge: _print_cred: ENCODED: Tue Dec 10 23:38:30 2024 [2024-12-09T23:38:56.645] auth/munge: _print_cred: DECODED: Mon Dec 09 23:38:56 2024 [2024-12-09T23:38:56.645] error: Check for out of sync clocks
One system is 24 hours behind/ahead of the other.
You should make sure NTP is set up and working on all these nodes.
-- Chris Samuel : https://apc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel...http://www.csamuel.org/ : Berkeley, CA, USA
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com