Hi, As suggested, 8><--- Stop their services, start them manually one by one (ctld first), then watch whether they talk to each other, and if they don't, learn what stops them from doing so - then iterate editing the config, "scontrol reconfig", lather, rinse, repeat. 8><--- Error logs, On node3, ======== [2024-12-09T21:43:10.694] slurmd started on Mon, 09 Dec 2024 21:43:10 +0000 [2024-12-09T21:43:10.694] CPUs=20 Boards=1 Sockets=20 Cores=1 Threads=1 Memory=48269 TmpDisk=23308 Uptime=3703 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2024-12-09T23:38:56.645] error: Munge decode failed: Rewound credential [2024-12-09T23:38:56.645] auth/munge: _print_cred: ENCODED: Tue Dec 10 23:38:30 2024 [2024-12-09T23:38:56.645] auth/munge: _print_cred: DECODED: Mon Dec 09 23:38:56 2024 [2024-12-09T23:38:56.645] error: Check for out of sync clocks [2024-12-09T23:38:56.645] error: slurm_receive_msg_and_forward: g_slurm_auth_verify: REQUEST_PING has authentication error: Invalid authentication credential [2024-12-09T23:38:56.645] error: slurm_receive_msg_and_forward: Protocol authentication error [2024-12-09T23:38:56.655] error: service_connection: slurm_receive_msg: Protocol authentication error [2024-12-10T01:13:18.454] Slurmd shutdown completing ====== I have checked the time, this is fine munge.key I ran md5sum over it and get the same output. When I reboot the warewulf nodes its a common Container file so 1~2 working but not 3 is just wierd. On the controller, ========= [2024-12-10T01:30:35.108] Node node2 appears to have a different version of Slurm than ours. Please update at your earliest convenience. [2024-12-10T01:30:35.109] Node node1 appears to have a different version of Slurm than ours. Please update at your earliest convenience. [2024-12-10T01:30:35.109] Node node6 appears to have a different version of Slurm than ours. Please update at your earliest convenience. [2024-12-10T01:30:35.109] Node node7 appears to have a different version of Slurm than ours. Please update at your earliest convenience. [2024-12-10T01:30:35.110] Node node3 appears to have a different version of Slurm than ours. Please update at your earliest convenience. [2024-12-10T01:30:35.110] Node node5 appears to have a different version of Slurm than ours. Please update at your earliest convenience. [2024-12-10T01:30:35.110] Node node4 appears to have a different version of Slurm than ours. Please update at your earliest convenience. [2024-12-10T01:30:36.104] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 ========== So the controller is RHEL9 and the nodes Rocky8 However 1~2 and 4~7 work OK. When I run, [root@vuwunicoslurmd1 log]# scontrol show node node3 NodeName=node3 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUEfctv=20 CPUTot=20 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=node3 NodeHostName=node3 Version=20.11.9 OS=Linux 4.18.0-553.30.1.el8_10.x86_64 #1 SMP Tue Nov 26 18:56:25 UTC 2024 RealMemory=48000 AllocMem=0 FreeMem=43378 Sockets=20 Boards=1 State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2024-12-09T20:41:27 SlurmdStartTime=2024-12-10T01:13:25 LastBusyTime=2024-12-10T01:30:31 CfgTRES=cpu=20,mem=48000M,billing=20 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Node unexpectedly rebooted [slurm@2024-12-09T20:40:27] [root@vuwunicoslurmd1 log]# I get a reason, its like the node is held down / off, does this need to be / can this be cleared? regards Steven
On 9/12/24 5:44 pm, Steven Jones via slurm-users wrote:
[2024-12-09T23:38:56.645] error: Munge decode failed: Rewound credential [2024-12-09T23:38:56.645] auth/munge: _print_cred: ENCODED: Tue Dec 10 23:38:30 2024 [2024-12-09T23:38:56.645] auth/munge: _print_cred: DECODED: Mon Dec 09 23:38:56 2024 [2024-12-09T23:38:56.645] error: Check for out of sync clocks
One system is 24 hours behind/ahead of the other. You should make sure NTP is set up and working on all these nodes. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Thanks, The times were correct via chrony but the timezones were UTC and NZDT which was the issue. Oddly nodes 1 and 2 didnt care about that, only no3 ***shrug*** regards Steven ________________________________ From: Chris Samuel via slurm-users <slurm-users@lists.schedmd.com> Sent: Tuesday, 10 December 2024 4:19 pm To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> Subject: [slurm-users] Re: node3 not working - down On 9/12/24 5:44 pm, Steven Jones via slurm-users wrote:
[2024-12-09T23:38:56.645] error: Munge decode failed: Rewound credential [2024-12-09T23:38:56.645] auth/munge: _print_cred: ENCODED: Tue Dec 10 23:38:30 2024 [2024-12-09T23:38:56.645] auth/munge: _print_cred: DECODED: Mon Dec 09 23:38:56 2024 [2024-12-09T23:38:56.645] error: Check for out of sync clocks
One system is 24 hours behind/ahead of the other. You should make sure NTP is set up and working on all these nodes. -- Chris Samuel : https://apc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=05%7C02%7Csteven.jones%40vuw.ac.nz%7Cfe8bb9470ea84afe3b4e08dd18cc10ec%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C638693987008953805%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=w5pRqr6%2F1Jf7Gk0OxhB9otiG9o1CyHlkK2yGrXHYfhY%3D&reserved=0<http://www.csamuel.org/> : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
participants (2)
-
Chris Samuel -
Steven Jones