Hello, We recently installed slurm-25 on Redhat linux. We failed to start the slurmctld service. sudo systemctl start slurmctld Job for slurmctld.service failed because the control process exited with error code. See "systemctl status slurmctld.service" and "journalctl -xeu slurmctld.service" for details.
sudo systemctl status slurmctld × slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/local/lib/systemd/system/slurmctld.service; enabled; preset: disabled) Active: failed (Result: exit-code) since Thu 2025-07-31 22:23:18 EDT; 1min 3s ago Process: 44317 ExecStart=/usr/local/sbin/slurmctld --systemd $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE) Main PID: 44317 (code=exited, status=1/FAILURE) CPU: 35ms
Jul 31 22:22:29 fgcu-compute01 systemd[1]: Starting Slurm controller daemon... Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: If munged is up, restart with --num-threads=10 Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: Munge encode failed: Failed to connect to "/run/munge/mung> Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: Failed to create MUNGE Credential Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: Couldn't load specified plugin name for auth/munge: Plugin> Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: cannot create auth context for auth/munge Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] fatal: failed to initialize auth plugin Jul 31 22:23:18 fgcu-compute01 systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE Jul 31 22:23:18 fgcu-compute01 systemd[1]: slurmctld.service: Failed with result 'exit-code'. Jul 31 22:23:18 fgcu-compute01 systemd[1]: Failed to start Slurm controller daemon.
Here is munge service status. munge.service - MUNGE authentication service Loaded: loaded (/usr/local/lib/systemd/system/munge.service; enabled; preset: disabled) Active: active (running) since Thu 2025-07-31 22:06:14 EDT; 19min ago Docs: man:munged(8) Main PID: 44039 (munged) Tasks: 4 (limit: 606218) Memory: 1.4M CPU: 18ms CGroup: /system.slice/munge.service └─44039 /usr/local/sbin/munged
Jul 31 22:06:14 fgcu-compute01 systemd[1]: Starting MUNGE authentication service... Jul 31 22:06:14 fgcu-compute01 systemd[1]: Started MUNGE authentication service.
Any suggestion is apprecieted to resolve this issue.
Thanks,
Nilesh Dhumal
Associate Professor of Chemistry,
http://faculty.fgcu.edu/ndhumal/
Coordinator, FGCU Computational Facility,
https://www.fgcu.edu/cas/facultyresources/computationalfacility/ SH-430; Department of Chemistry and Physics Florida Gulf Coast University 10501 FGCU Boulevard South Fort Myers, FL 33965-6565 Phone: (239) 745-4394 Email: ndhumal@fgcu.edu
Hi Nilesh,
It seems that your Munge setup isn't working. Maybe the munge.key file isn't shared on all nodes?
I recommend you to take a look at this Wiki page: https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/ to get a complete overview of the tasks involved in setting up a Slurm cluster.
IHTH, Ole
On 8/1/25 04:26, Dhumal, Dr. Nilesh via slurm-users wrote:
Hello, We recently installed slurm-25 on Redhat linux. We failed to start the slurmctld service. sudo systemctl start slurmctld Job for slurmctld.service failed because the control process exited with error code. See "systemctl status slurmctld.service" and "journalctl -xeu slurmctld.service" for details.
sudo systemctl status slurmctld × slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/local/lib/systemd/system/slurmctld.service; enabled; preset: disabled) Active: failed (Result: exit-code) since Thu 2025-07-31 22:23:18 EDT; 1min 3s ago Process: 44317 ExecStart=/usr/local/sbin/slurmctld --systemd $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE) Main PID: 44317 (code=exited, status=1/FAILURE) CPU: 35ms
Jul 31 22:22:29 fgcu-compute01 systemd[1]: Starting Slurm controller daemon... Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: If munged is up, restart with --num-threads=10 Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: Munge encode failed: Failed to connect to "/run/munge/mung> Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: Failed to create MUNGE Credential Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: Couldn't load specified plugin name for auth/munge: Plugin> Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: cannot create auth context for auth/munge Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] fatal: failed to initialize auth plugin Jul 31 22:23:18 fgcu-compute01 systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE Jul 31 22:23:18 fgcu-compute01 systemd[1]: slurmctld.service: Failed with result 'exit-code'. Jul 31 22:23:18 fgcu-compute01 systemd[1]: Failed to start Slurm controller daemon.
Here is munge service status. munge.service - MUNGE authentication service Loaded: loaded (/usr/local/lib/systemd/system/munge.service; enabled; preset: disabled) Active: active (running) since Thu 2025-07-31 22:06:14 EDT; 19min ago Docs: man:munged(8) Main PID: 44039 (munged) Tasks: 4 (limit: 606218) Memory: 1.4M CPU: 18ms CGroup: /system.slice/munge.service └─44039 /usr/local/sbin/munged
Jul 31 22:06:14 fgcu-compute01 systemd[1]: Starting MUNGE authentication service... Jul 31 22:06:14 fgcu-compute01 systemd[1]: Started MUNGE authentication service.
Any suggestion is apprecieted to resolve this issue.
Thanks. I was testing on master node without connecting to computer nodes.
Get Outlook for Androidhttps://aka.ms/AAb9ysg ________________________________ From: Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com Sent: Friday, August 1, 2025 1:54:42 AM To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: slurmctld failed to start
External Email: Do not click links or attachments unless you recognize the sender and know the content is safe.
Hi Nilesh,
It seems that your Munge setup isn't working. Maybe the munge.key file isn't shared on all nodes?
I recommend you to take a look at this Wiki page: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.fysik...https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/ to get a complete overview of the tasks involved in setting up a Slurm cluster.
IHTH, Ole
On 8/1/25 04:26, Dhumal, Dr. Nilesh via slurm-users wrote:
Hello, We recently installed slurm-25 on Redhat linux. We failed to start the slurmctld service. sudo systemctl start slurmctld Job for slurmctld.service failed because the control process exited with error code. See "systemctl status slurmctld.service" and "journalctl -xeu slurmctld.service" for details.
sudo systemctl status slurmctld × slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/local/lib/systemd/system/slurmctld.service; enabled; preset: disabled) Active: failed (Result: exit-code) since Thu 2025-07-31 22:23:18 EDT; 1min 3s ago Process: 44317 ExecStart=/usr/local/sbin/slurmctld --systemd $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE) Main PID: 44317 (code=exited, status=1/FAILURE) CPU: 35ms
Jul 31 22:22:29 fgcu-compute01 systemd[1]: Starting Slurm controller daemon... Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: If munged is up, restart with --num-threads=10 Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: Munge encode failed: Failed to connect to "/run/munge/mung> Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: Failed to create MUNGE Credential Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: Couldn't load specified plugin name for auth/munge: Plugin> Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: cannot create auth context for auth/munge Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] fatal: failed to initialize auth plugin Jul 31 22:23:18 fgcu-compute01 systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE Jul 31 22:23:18 fgcu-compute01 systemd[1]: slurmctld.service: Failed with result 'exit-code'. Jul 31 22:23:18 fgcu-compute01 systemd[1]: Failed to start Slurm controller daemon.
Here is munge service status. munge.service - MUNGE authentication service Loaded: loaded (/usr/local/lib/systemd/system/munge.service; enabled; preset: disabled) Active: active (running) since Thu 2025-07-31 22:06:14 EDT; 19min ago Docs: man:munged(8) Main PID: 44039 (munged) Tasks: 4 (limit: 606218) Memory: 1.4M CPU: 18ms CGroup: /system.slice/munge.service └─44039 /usr/local/sbin/munged
Jul 31 22:06:14 fgcu-compute01 systemd[1]: Starting MUNGE authentication service... Jul 31 22:06:14 fgcu-compute01 systemd[1]: Started MUNGE authentication service.
Any suggestion is apprecieted to resolve this issue.
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
On 8/1/25 09:34, Dhumal, Dr. Nilesh wrote:
Thanks. I was testing on master node without connecting to computer nodes.
So you need to test your Munge setup, see https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#munge-configur...
/Ole
*From:* Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com *Sent:* Friday, August 1, 2025 1:54:42 AM *To:* slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: slurmctld failed to start External Email: Do not click links or attachments unless you recognize the sender and know the content is safe.
Hi Nilesh,
It seems that your Munge setup isn't working. Maybe the munge.key file isn't shared on all nodes?
I recommend you to take a look at this Wiki page: https://nam04.safelinks.protection.outlook.com/? url=https%3A%2F%2Fwiki.fysik.dtu.dk%2FNiflheim_system%2FSlurm_installation%2F&data=05%7C02%7Cndhumal%40fgcu.edu%7C0601ca47bf07420ac1d108ddd0c03310%7Cf7a5a4ef4ffa4c80bfb3c12e28872099%7C0%7C0%7C638896246124312890%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=SeuoAq5%2FxpoTpzKOb%2FhViNFqQGksOSVvWpm8lFgtfzE%3D&reserved=0 https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/ to get a complete overview of the tasks involved in setting up a Slurm cluster.
IHTH, Ole
On 8/1/25 04:26, Dhumal, Dr. Nilesh via slurm-users wrote:
Hello, We recently installed slurm-25 on Redhat linux. We failed to start the slurmctld service. sudo systemctl start slurmctld Job for slurmctld.service failed because the control process exited with error code. See "systemctl status slurmctld.service" and "journalctl -xeu slurmctld.service" for details.
sudo systemctl status slurmctld × slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/local/lib/systemd/system/slurmctld.service; enabled; preset: disabled) Active: failed (Result: exit-code) since Thu 2025-07-31 22:23:18 EDT; 1min 3s ago Process: 44317 ExecStart=/usr/local/sbin/slurmctld --systemd $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE) Main PID: 44317 (code=exited, status=1/FAILURE) CPU: 35ms
Jul 31 22:22:29 fgcu-compute01 systemd[1]: Starting Slurm controller daemon... Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: If munged is up, restart with --num-threads=10 Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: Munge encode failed: Failed to connect to "/run/munge/mung> Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: Failed to create MUNGE Credential Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: Couldn't load specified plugin name for auth/munge: Plugin> Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] error: cannot create auth context for auth/munge Jul 31 22:23:18 fgcu-compute01 slurmctld[44317]: [2025-07-31T22:23:18.679] fatal: failed to initialize auth plugin Jul 31 22:23:18 fgcu-compute01 systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE Jul 31 22:23:18 fgcu-compute01 systemd[1]: slurmctld.service: Failed with result 'exit-code'. Jul 31 22:23:18 fgcu-compute01 systemd[1]: Failed to start Slurm controller daemon.
Here is munge service status. munge.service - MUNGE authentication service Loaded: loaded (/usr/local/lib/systemd/system/munge.service; enabled; preset: disabled) Active: active (running) since Thu 2025-07-31 22:06:14 EDT; 19min ago Docs: man:munged(8) Main PID: 44039 (munged) Tasks: 4 (limit: 606218) Memory: 1.4M CPU: 18ms CGroup: /system.slice/munge.service └─44039 /usr/local/sbin/munged
Jul 31 22:06:14 fgcu-compute01 systemd[1]: Starting MUNGE authentication service... Jul 31 22:06:14 fgcu-compute01 systemd[1]: Started MUNGE authentication service.
Any suggestion is apprecieted to resolve this issue.