Thanks.  I was testing on master node without connecting to computer nodes.  

Get Outlook for Android

From: Ole Holm Nielsen via slurm-users <slurm-users@lists.schedmd.com>
Sent: Friday, August 1, 2025 1:54:42 AM
To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Re: slurmctld failed to start
 
   External Email: Do not click links or attachments unless you recognize the sender and know the content is safe.

Hi Nilesh,

It seems that your Munge setup isn't working.  Maybe the munge.key file
isn't shared on all nodes?

I recommend you to take a look at this Wiki page:
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.fysik.dtu.dk%2FNiflheim_system%2FSlurm_installation%2F&data=05%7C02%7Cndhumal%40fgcu.edu%7C0601ca47bf07420ac1d108ddd0c03310%7Cf7a5a4ef4ffa4c80bfb3c12e28872099%7C0%7C0%7C638896246124312890%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=SeuoAq5%2FxpoTpzKOb%2FhViNFqQGksOSVvWpm8lFgtfzE%3D&reserved=0
to get a complete overview of the tasks involved in setting up a Slurm
cluster.

IHTH,
Ole

On 8/1/25 04:26, Dhumal, Dr. Nilesh via slurm-users wrote:
> Hello,
> We recently installed slurm-25 on Redhat linux.
> We failed to start the slurmctld service.
> sudo systemctl start slurmctld
> Job for slurmctld.service failed because the control process exited with
> error code.
> See "systemctl status slurmctld.service" and "journalctl -xeu
> slurmctld.service" for details.
>
> sudo systemctl status slurmctld
> × slurmctld.service - Slurm controller daemon
>       Loaded: loaded (/usr/local/lib/systemd/system/slurmctld.service;
> enabled; preset: disabled)
>       Active: failed (Result: exit-code) since Thu 2025-07-31 22:23:18
> EDT; 1min 3s ago
>      Process: 44317 ExecStart=/usr/local/sbin/slurmctld --systemd
> $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
>     Main PID: 44317 (code=exited, status=1/FAILURE)
>          CPU: 35ms
>
> Jul 31 22:22:29 fgcu-compute01 systemd[1]: Starting Slurm controller daemon...
> Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679]
> error: If munged is up, restart with --num-threads=10
> Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679]
> error: Munge encode failed: Failed to connect to "/run/munge/mung>
> Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679]
> error: Failed to create MUNGE Credential
> Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679]
> error: Couldn't load specified plugin name for auth/munge: Plugin>
> Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679]
> error: cannot create auth context for auth/munge
> Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679]
> fatal: failed to initialize auth plugin
> Jul 31 22:23:18 fgcu-compute01 systemd[1]: slurmctld.service: Main process
> exited, code=exited, status=1/FAILURE
> Jul 31 22:23:18 fgcu-compute01 systemd[1]: slurmctld.service: Failed with
> result 'exit-code'.
> Jul 31 22:23:18 fgcu-compute01 systemd[1]: Failed to start Slurm
> controller daemon.
>
> Here is munge service status.
> munge.service - MUNGE authentication service
>       Loaded: loaded (/usr/local/lib/systemd/system/munge.service;
> enabled; preset: disabled)
>       Active: active (running) since Thu 2025-07-31 22:06:14 EDT; 19min ago
>         Docs: man:munged(8)
>     Main PID: 44039 (munged)
>        Tasks: 4 (limit: 606218)
>       Memory: 1.4M
>          CPU: 18ms
>       CGroup: /system.slice/munge.service
>               └─44039 /usr/local/sbin/munged
>
> Jul 31 22:06:14 fgcu-compute01 systemd[1]: Starting MUNGE authentication
> service...
> Jul 31 22:06:14 fgcu-compute01 systemd[1]: Started MUNGE authentication
> service.
>
> Any suggestion is apprecieted to resolve this issue.

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com