Nodes still down so it wasnt time skew.
I have run tests as per munge docs and it all looks OK.
[root@node1 ~]# munge -n | unmunge | grep STATUS
[root@node1 ~]# ssh admjonesst1@vuw.ac.nz@vuwunicoslurmd1.ods.vuw.ac.nz munge -n -t 10 | unmunge
[root@node1 ~]# munge -n -t 10 | ssh admjonesst1@vuw.ac.nz@vuwunicoslurmd1.ods.vuw.ac.nz unmunge
[You don't often get email from steffen.grunewald@aei.mpg.de. Learn why this is important at
https://aka.ms/LearnAboutSenderIdentification ]
iHi,
On Sun, 2024-12-08 at 21:57:11 +0000, Slurm users wrote:
> I have just rebuilt all my nodes and I see
Did they ever work before with Slurm? (Which version?)
> Only 1 & 2 seem available?
> While 3~6 are not
Either you didn't wait long enough (5 minutes should be sufficient),
or the "down*" nodes don't have a slurmd that talks to the slurmctld.
The reasons for the latter can only be speculated about.
> 3's log,
>
> [root@node3 log]# tail slurmd.log
> [2024-12-08T21:45:51.250] CPU frequency setting not configured for this node
> [2024-12-08T21:45:51.251] slurmd version 20.11.9 started
> [2024-12-08T21:45:51.252] slurmd started on Sun, 08 Dec 2024 21:45:51 +0000
> [2024-12-08T21:45:51.252] CPUs=20 Boards=1 Sockets=20 Cores=1 Threads=1 Memory=48269 TmpDisk=23324 Uptime=30 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
Does this match (exceed, for Memory and TmpDisk) the node declaration
known by the slurmctld?
> And 7 doesnt want to talk to the controller.
>
> [root@node7 slurm]# sinfo
> slurm_load_partitions: Zero Bytes were transmitted or received
Does it have munge running, with the right key?
I've seen this message when authorization was lost.
> These are all rebuilt and 1~3 are identical and 4~7 are identical.
Are the node declarations also identical, respectively?
Do they show the same features in slurmd.log?
> [root@vuwunicoslurmd1 slurm]# sinfo
> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> debug* up infinite 2 idle* node[1-2]
> debug* up infinite 4 down* node[3-6]
What you see here is what the slurmctld sees.
The usual procedure to debug is to run the daemons that don't cooperate,
in debug mode.
Stop their services, start them manually one by one (ctld first), then
watch whether they talk to each other, and if they don't, learn what stops
them from doing so - then iterate editing the config, "scontrol reconfig",
lather, rinse, repeat.
You're the only one knowing your node configuration lines (NodeName=...),
so we can't help any further. Ole's pages perhaps can.
Best,
S
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~