Configless Slurm Error: failed to fetch remote configs

List overview All Threads
Download

newer

older

How to get strigger event...

Nodes removed from topology after...

Xaver Stiensmeier

16 Jan 2026 16 Jan '26

2:11 a.m.

Hey everyone,

in the past we set up clusters with configs on each node. Now we want to explore configless. Without changing anything else, we therefore followed: https://slurm.schedmd.com/configless_slurm.html and added 'enable_configless' in the config on the master:

SlurmctldParameters=cloud_dns,idle_on_node_suspend,enable_configless,reconfig_on_restart

and start each worker's slurmd with the conf-server parameter:

# Override systemd service to set conditional path [Service] ExecStart= ExecStart=/usr/sbin/slurmd --conf-server=master

However, this leads to:

slurmd: error: _fetch_child: failed to fetch remote configs: Protocol authentication error

slurmd: error: _establish_configuration: failed to load configs. Retrying in 10 seconds.

on the workers and on the master (/var/log/slurm/slurmctld) to:

[2026-01-16T10:00:06.681] error: Munge decode failed: Invalid credential [2026-01-16T10:00:06.681] auth/munge: _print_cred: ENCODED: Thu Jan 01 00:00:00 1970 [2026-01-16T10:00:06.681] auth/munge: _print_cred: DECODED: Thu Jan 01 00:00:00 1970 [2026-01-16T10:00:06.681] error: slurm_unpack_received_msg: [[worker]:24295] auth_g_verify: REQUEST_CONFIG has authentication error: Unspecified error [2026-01-16T10:00:06.681] error: slurm_unpack_received_msg: [[worker]:24295] Protocol authentication error

The munge key setup is the same as before so I don't think there is anything wrong with it unless something changes with configless (slurm.conf):

AuthType=auth/munge CryptoType=crypto/munge AuthAltTypes=auth/jwt AuthAltParameters=jwt_key=/etc/slurm/jwt-secret.key

I found https://groups.google.com/g/slurm-users/c/Q7FVkhx-bOs but this seems unrelated as both can talk fine with each other:

worker:~$ nc -zv master 6817 Connection to master (192.168.20.169) 6817 port [tcp/*] succeeded!

I tried adding more "-v" to the slurmd start, but that did not give more information. I am unsure how to debug this further. Somehow I think it must be a munge issue, but I am confused as this part hasn't changed.

Best regards, Xaver

Attachments:

attachment.html (text/html — 3.0 KB)

Show replies by date

Ole Holm Nielsen

16 Jan 16 Jan

3:11 a.m.

Hi Xaver,

We have been running Configless Slurm for a number of years, and we're very happy with this setup. I have documented all the detailed configurations we made in this Wiki page, so maybe you want to consult this page:

https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#configless-sl...

IHTH, Ole

On 1/16/26 11:11, Xaver Stiensmeier via slurm-users wrote:

...

Hey everyone,

in the past we set up clusters with configs on each node. Now we want to explore configless. Without changing anything else, we therefore followed: https://slurm.schedmd.com/configless_slurm.html and added 'enable_configless' in the config on the master:
SlurmctldParameters=cloud_dns,idle_on_node_suspend,enable_configless,reconfig_on_restart
and start each worker's slurmd with the conf-server parameter:
# Override systemd service to set conditional path
[Service]
ExecStart=
ExecStart=/usr/sbin/slurmd --conf-server=master
However, this leads to:
slurmd: error: _fetch_child: failed to fetch remote configs: Protocol
authentication error

slurmd: error: _establish_configuration: failed to load configs.
Retrying in 10 seconds.
on the workers and on the master (/var/log/slurm/slurmctld) to:
[2026-01-16T10:00:06.681] error: Munge decode failed: Invalid credential
[2026-01-16T10:00:06.681] auth/munge: _print_cred: ENCODED: Thu Jan 01
00:00:00 1970
[2026-01-16T10:00:06.681] auth/munge: _print_cred: DECODED: Thu Jan 01
00:00:00 1970
[2026-01-16T10:00:06.681] error: slurm_unpack_received_msg:
[[worker]:24295] auth_g_verify: REQUEST_CONFIG has authentication
error: Unspecified error
[2026-01-16T10:00:06.681] error: slurm_unpack_received_msg:
[[worker]:24295] Protocol authentication error
The munge key setup is the same as before so I don't think there is anything wrong with it unless something changes with configless (slurm.conf):
AuthType=auth/munge
CryptoType=crypto/munge
AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/etc/slurm/jwt-secret.key
I found https://groups.google.com/g/slurm-users/c/Q7FVkhx-bOs but this seems unrelated as both can talk fine with each other:
worker:~$ nc -zv master 6817
Connection to master (192.168.20.169) 6817 port [tcp/*] succeeded!
I tried adding more "-v" to the slurmd start, but that did not give more information. I am unsure how to debug this further. Somehow I think it must be a munge issue, but I am confused as this part hasn't changed.

Best regards, Xaver

Xaver Stiensmeier

19 Jan 19 Jan

5:48 a.m.

Hey Ole,

thank you so much for your in detail documentation which leaves me both with answers and questions. Apparently, the aforementioned error had nothing to do with munge but with some issues regarding the reload of slurmd which I can't really reproduce. I think I somehow had two running and only killed one, but this is difficult to tell, because once I redid the entire setup, half the issue disappeared.

The remaining issue is that Slurmd can't start via systemctl as Slurmd never notifies systemctl that it is ready. I was able to fix this by setting:

[Service] Type=simple

which allows the start and then Slurm is able to reach the node, config files are pulled as expected and I can schedule commands on the node.

While this leaves me with a running system, I still get:

ubuntu@worker:~$ systemctl status slurmd.service ○ slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; preset: enabled) Drop-In: /etc/systemd/system/slurmd.service.d └─override.conf Active: inactive (dead) since Mon 2026-01-19 13:31:28 UTC; 8min ago Duration: 7ms Process: 19712 ExecStart=/usr/sbin/slurmd --conf-server=master (code=exited, status=0/SUCCESS) Main PID: 19712 (code=exited, status=0/SUCCESS) Tasks: 11 (limit: 19147) Memory: 4.2M (peak: 6.4M) CPU: 110ms CGroup: /system.slice/slurmd.service └─19714 /usr/sbin/slurmd --conf-server=master

Jan 19 13:31:28 worker systemd[1]: Started slurmd.service - Slurm node daemon. Jan 19 13:31:28 worker systemd[1]: slurmd.service: Deactivated successfully. Jan 19 13:31:28 worker systemd[1]: slurmd.service: Unit process 19713 (slurmd) remains running after unit stopped. Jan 19 13:31:28 worker systemd[1]: slurmd.service: Unit process 19714 (slurmd) remains running after unit stopped. Jan 19 13:31:28 worker slurmd[19716]: error: _fetch_child: failed to fetch remote configs: Protocol authentication error Jan 19 13:31:28 worker slurmd[19714]: error: _establish_configuration: failed to load configs. Retrying in 10 seconds.

This leaves me with the guess that the initial fail that then succeeds might cause systemctl to abort early. Note that we setup our Slurm cluster via Ansible scripts so there might also be a race condition I am overlooking that causes parts of the authentication not being ready; however, this was not an issue before we tried configless.

Best, Xaver

On 1/16/26 12:11, Ole Holm Nielsen via slurm-users wrote:

...

Hi Xaver,

We have been running Configless Slurm for a number of years, and we're very happy with this setup. I have documented all the detailed configurations we made in this Wiki page, so maybe you want to consult this page:

https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#configless-sl...

IHTH, Ole

On 1/16/26 11:11, Xaver Stiensmeier via slurm-users wrote:

...
Hey everyone,

in the past we set up clusters with configs on each node. Now we want to explore configless. Without changing anything else, we therefore followed: https://slurm.schedmd.com/configless_slurm.html and added 'enable_configless' in the config on the master:

SlurmctldParameters=cloud_dns,idle_on_node_suspend,enable_configless,reconfig_on_restart

and start each worker's slurmd with the conf-server parameter:

# Override systemd service to set conditional path     [Service]     ExecStart=     ExecStart=/usr/sbin/slurmd --conf-server=master

However, this leads to:

slurmd: error: _fetch_child: failed to fetch remote configs: Protocol     authentication error

slurmd: error: _establish_configuration: failed to load configs.     Retrying in 10 seconds.

on the workers and on the master (/var/log/slurm/slurmctld) to:

[2026-01-16T10:00:06.681] error: Munge decode failed: Invalid credential     [2026-01-16T10:00:06.681] auth/munge: _print_cred: ENCODED: Thu Jan 01     00:00:00 1970     [2026-01-16T10:00:06.681] auth/munge: _print_cred: DECODED: Thu Jan 01     00:00:00 1970     [2026-01-16T10:00:06.681] error: slurm_unpack_received_msg:     [[worker]:24295] auth_g_verify: REQUEST_CONFIG has authentication     error: Unspecified error     [2026-01-16T10:00:06.681] error: slurm_unpack_received_msg:     [[worker]:24295] Protocol authentication error

The munge key setup is the same as before so I don't think there is anything wrong with it unless something changes with configless (slurm.conf):

AuthType=auth/munge     CryptoType=crypto/munge     AuthAltTypes=auth/jwt     AuthAltParameters=jwt_key=/etc/slurm/jwt-secret.key

I found https://groups.google.com/g/slurm-users/c/Q7FVkhx-bOs but this seems unrelated as both can talk fine with each other:

worker:~$ nc -zv master 6817     Connection to master (192.168.20.169) 6817 port [tcp/*] succeeded!

I tried adding more "-v" to the slurmd start, but that did not give more information. I am unsure how to debug this further. Somehow I think it must be a munge issue, but I am confused as this part hasn't changed.

Best regards, Xaver

Ole Holm Nielsen

20 Jan 20 Jan

4:39 a.m.

Hi Xaver,

I have no experience with Ubuntu systems, which may behave differently from our RockyLinux 8. Setting up Slurm with Ansible should be fine, and this is also how we configure our Slurm servers and login nodes (but not slurmd nodes). Once Ansible is finished the system ought to work.

Did you build your Slurm packages with the Debian build system, see https://slurm.schedmd.com/quickstart_admin.html#debuild

Do you run a recent Slurm version (24.11 and later are currently supported)?

I wonder if the error:

...

error: _fetch_child: failed to fetch remote configs: Protocol authentication error

is due to the network not yet being up after reboot? Restarting slurmd manually should hopefully work.

IHTH, Ole

On 1/19/26 14:48, Xaver Stiensmeier wrote:

...

Hey Ole,

thank you so much for your in detail documentation which leaves me both with answers and questions. Apparently, the aforementioned error had nothing to do with munge but with some issues regarding the reload of slurmd which I can't really reproduce. I think I somehow had two running and only killed one, but this is difficult to tell, because once I redid the entire setup, half the issue disappeared.

The remaining issue is that Slurmd can't start via systemctl as Slurmd never notifies systemctl that it is ready. I was able to fix this by setting:
[Service]
Type=simple
which allows the start and then Slurm is able to reach the node, config files are pulled as expected and I can schedule commands on the node.

While this leaves me with a running system, I still get:
ubuntu@worker:~$ systemctl status slurmd.service
○ slurmd.service - Slurm node daemon
      Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled;
preset: enabled)
     Drop-In: /etc/systemd/system/slurmd.service.d
              └─override.conf
      Active: inactive (dead) since Mon 2026-01-19 13:31:28 UTC; 8min ago
    Duration: 7ms
     Process: 19712 ExecStart=/usr/sbin/slurmd --conf-server=master
(code=exited, status=0/SUCCESS)
    Main PID: 19712 (code=exited, status=0/SUCCESS)
       Tasks: 11 (limit: 19147)
      Memory: 4.2M (peak: 6.4M)
         CPU: 110ms
      CGroup: /system.slice/slurmd.service
              └─19714 /usr/sbin/slurmd --conf-server=master

Jan 19 13:31:28 worker systemd[1]: Started slurmd.service - Slurm node
daemon.
Jan 19 13:31:28 worker systemd[1]: slurmd.service: Deactivated
successfully.
Jan 19 13:31:28 worker systemd[1]: slurmd.service: Unit process 19713
(slurmd) remains running after unit stopped.
Jan 19 13:31:28 worker systemd[1]: slurmd.service: Unit process 19714
(slurmd) remains running after unit stopped.
Jan 19 13:31:28 worker slurmd[19716]: error: _fetch_child: failed to
fetch remote configs: Protocol authentication error
Jan 19 13:31:28 worker slurmd[19714]: error: _establish_configuration:
failed to load configs. Retrying in 10 seconds.
This leaves me with the guess that the initial fail that then succeeds might cause systemctl to abort early. Note that we setup our Slurm cluster via Ansible scripts so there might also be a race condition I am overlooking that causes parts of the authentication not being ready; however, this was not an issue before we tried configless.

Best, Xaver

On 1/16/26 12:11, Ole Holm Nielsen via slurm-users wrote:

...
Hi Xaver,

We have been running Configless Slurm for a number of years, and we're very happy with this setup. I have documented all the detailed configurations we made in this Wiki page, so maybe you want to consult this page:

https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/ #configless-slurm-setup

IHTH, Ole

On 1/16/26 11:11, Xaver Stiensmeier via slurm-users wrote:

...
Hey everyone,

in the past we set up clusters with configs on each node. Now we want to explore configless. Without changing anything else, we therefore followed: https://slurm.schedmd.com/configless_slurm.html and added 'enable_configless' in the config on the master:

SlurmctldParameters=cloud_dns,idle_on_node_suspend,enable_configless,reconfig_on_restart

and start each worker's slurmd with the conf-server parameter:

# Override systemd service to set conditional path     [Service]     ExecStart=     ExecStart=/usr/sbin/slurmd --conf-server=master

However, this leads to:

slurmd: error: _fetch_child: failed to fetch remote configs: Protocol     authentication error

slurmd: error: _establish_configuration: failed to load configs.     Retrying in 10 seconds.

on the workers and on the master (/var/log/slurm/slurmctld) to:

[2026-01-16T10:00:06.681] error: Munge decode failed: Invalid credential     [2026-01-16T10:00:06.681] auth/munge: _print_cred: ENCODED: Thu Jan 01     00:00:00 1970     [2026-01-16T10:00:06.681] auth/munge: _print_cred: DECODED: Thu Jan 01     00:00:00 1970     [2026-01-16T10:00:06.681] error: slurm_unpack_received_msg:     [[worker]:24295] auth_g_verify: REQUEST_CONFIG has authentication     error: Unspecified error     [2026-01-16T10:00:06.681] error: slurm_unpack_received_msg:     [[worker]:24295] Protocol authentication error

The munge key setup is the same as before so I don't think there is anything wrong with it unless something changes with configless (slurm.conf):

AuthType=auth/munge     CryptoType=crypto/munge     AuthAltTypes=auth/jwt     AuthAltParameters=jwt_key=/etc/slurm/jwt-secret.key

I found https://groups.google.com/g/slurm-users/c/Q7FVkhx-bOs but this seems unrelated as both can talk fine with each other:

worker:~$ nc -zv master 6817     Connection to master (192.168.20.169) 6817 port [tcp/*] succeeded!

I tried adding more "-v" to the slurmd start, but that did not give more information. I am unsure how to debug this further. Somehow I think it must be a munge issue, but I am confused as this part hasn't changed.

Best regards, Xaver

Xaver Stiensmeier

6:26 a.m.

Hey Ole,

we currently use Slurm 24.11 regarding the build process I have to get in touch with our Cloud admins as we build it and then offer it via mirror. However, I can confirm that it all worked without error before using configless.

The network is definitely already up as slurmd restarting does not help. However, I noticed that in these cases slurmd fails VERY quickly; it definitely does not wait for any timeout.

I primarily mentioned Ansible to support that I am pretty sure that the system is set up the same as before using configless.

Best, Xaver

On 1/20/26 13:39, Ole Holm Nielsen wrote:

...

Hi Xaver,

I have no experience with Ubuntu systems, which may behave differently from our RockyLinux 8. Setting up Slurm with Ansible should be fine, and this is also how we configure our Slurm servers and login nodes (but not slurmd nodes). Once Ansible is finished the system ought to work.

Did you build your Slurm packages with the Debian build system, see https://slurm.schedmd.com/quickstart_admin.html#debuild

Do you run a recent Slurm version (24.11 and later are currently supported)?

I wonder if the error:

...
error: _fetch_child: failed to fetch remote configs: Protocol authentication error

is due to the network not yet being up after reboot? Restarting slurmd manually should hopefully work.

IHTH, Ole

On 1/19/26 14:48, Xaver Stiensmeier wrote:

...
Hey Ole,

thank you so much for your in detail documentation which leaves me both with answers and questions. Apparently, the aforementioned error had nothing to do with munge but with some issues regarding the reload of slurmd which I can't really reproduce. I think I somehow had two running and only killed one, but this is difficult to tell, because once I redid the entire setup, half the issue disappeared.

The remaining issue is that Slurmd can't start via systemctl as Slurmd never notifies systemctl that it is ready. I was able to fix this by setting:

[Service]     Type=simple

which allows the start and then Slurm is able to reach the node, config files are pulled as expected and I can schedule commands on the node.

While this leaves me with a running system, I still get:

ubuntu@worker:~$ systemctl status slurmd.service     ○ slurmd.service - Slurm node daemon      Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled;     preset: enabled)      Drop-In: /etc/systemd/system/slurmd.service.d      └─override.conf      Active: inactive (dead) since Mon 2026-01-19 13:31:28 UTC; 8min ago      Duration: 7ms      Process: 19712 ExecStart=/usr/sbin/slurmd --conf-server=master     (code=exited, status=0/SUCCESS)      Main PID: 19712 (code=exited, status=0/SUCCESS)      Tasks: 11 (limit: 19147)      Memory: 4.2M (peak: 6.4M)      CPU: 110ms      CGroup: /system.slice/slurmd.service      └─19714 /usr/sbin/slurmd --conf-server=master

Jan 19 13:31:28 worker systemd[1]: Started slurmd.service - Slurm node     daemon.     Jan 19 13:31:28 worker systemd[1]: slurmd.service: Deactivated     successfully.     Jan 19 13:31:28 worker systemd[1]: slurmd.service: Unit process 19713     (slurmd) remains running after unit stopped.     Jan 19 13:31:28 worker systemd[1]: slurmd.service: Unit process 19714     (slurmd) remains running after unit stopped.     Jan 19 13:31:28 worker slurmd[19716]: error: _fetch_child: failed to     fetch remote configs: Protocol authentication error     Jan 19 13:31:28 worker slurmd[19714]: error: _establish_configuration:     failed to load configs. Retrying in 10 seconds.

This leaves me with the guess that the initial fail that then succeeds might cause systemctl to abort early. Note that we setup our Slurm cluster via Ansible scripts so there might also be a race condition I am overlooking that causes parts of the authentication not being ready; however, this was not an issue before we tried configless.

Best, Xaver

On 1/16/26 12:11, Ole Holm Nielsen via slurm-users wrote:

...
Hi Xaver,

We have been running Configless Slurm for a number of years, and we're very happy with this setup. I have documented all the detailed configurations we made in this Wiki page, so maybe you want to consult this page:

https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/ #configless-slurm-setup

IHTH, Ole

On 1/16/26 11:11, Xaver Stiensmeier via slurm-users wrote:

...
Hey everyone,

in the past we set up clusters with configs on each node. Now we want to explore configless. Without changing anything else, we therefore followed: https://slurm.schedmd.com/configless_slurm.html and added 'enable_configless' in the config on the master:

SlurmctldParameters=cloud_dns,idle_on_node_suspend,enable_configless,reconfig_on_restart

and start each worker's slurmd with the conf-server parameter:

# Override systemd service to set conditional path     [Service]     ExecStart=     ExecStart=/usr/sbin/slurmd --conf-server=master

However, this leads to:

slurmd: error: _fetch_child: failed to fetch remote configs: Protocol     authentication error

slurmd: error: _establish_configuration: failed to load configs.     Retrying in 10 seconds.

on the workers and on the master (/var/log/slurm/slurmctld) to:

[2026-01-16T10:00:06.681] error: Munge decode failed: Invalid credential     [2026-01-16T10:00:06.681] auth/munge: _print_cred: ENCODED: Thu Jan 01     00:00:00 1970     [2026-01-16T10:00:06.681] auth/munge: _print_cred: DECODED: Thu Jan 01     00:00:00 1970     [2026-01-16T10:00:06.681] error: slurm_unpack_received_msg:     [[worker]:24295] auth_g_verify: REQUEST_CONFIG has authentication     error: Unspecified error     [2026-01-16T10:00:06.681] error: slurm_unpack_received_msg:     [[worker]:24295] Protocol authentication error

The munge key setup is the same as before so I don't think there is anything wrong with it unless something changes with configless (slurm.conf):

AuthType=auth/munge     CryptoType=crypto/munge     AuthAltTypes=auth/jwt     AuthAltParameters=jwt_key=/etc/slurm/jwt-secret.key

I found https://groups.google.com/g/slurm-users/c/Q7FVkhx-bOs but this seems unrelated as both can talk fine with each other:

worker:~$ nc -zv master 6817     Connection to master (192.168.20.169) 6817 port [tcp/*] succeeded!

I tried adding more "-v" to the slurmd start, but that did not give more information. I am unsure how to debug this further. Somehow I think it must be a munge issue, but I am confused as this part hasn't changed.

Best regards, Xaver

Ole Holm Nielsen

6:31 a.m.

Hi Xaver,

Are you sure that your DNS SRV record is responding?

$ dig +short +search +ndots=2 -t SRV -n _slurmctld._tcp

See https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#testing-confi...

Best regards, Ole

On 1/20/26 15:26, Xaver Stiensmeier wrote:

...

Hey Ole,

we currently use Slurm 24.11 regarding the build process I have to get in touch with our Cloud admins as we build it and then offer it via mirror. However, I can confirm that it all worked without error before using configless.

The network is definitely already up as slurmd restarting does not help. However, I noticed that in these cases slurmd fails VERY quickly; it definitely does not wait for any timeout.

I primarily mentioned Ansible to support that I am pretty sure that the system is set up the same as before using configless.

Best, Xaver

On 1/20/26 13:39, Ole Holm Nielsen wrote:

...
Hi Xaver,

I have no experience with Ubuntu systems, which may behave differently from our RockyLinux 8. Setting up Slurm with Ansible should be fine, and this is also how we configure our Slurm servers and login nodes (but not slurmd nodes). Once Ansible is finished the system ought to work.

Did you build your Slurm packages with the Debian build system, see https://eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Fslurm.schedmd.com%2Fquickstart_admin.html%23debuild&data=05%7C02%7COle.H.Nielsen%40fysik.dtu.dk%7C3f17379c9c164ab679aa08de582fe9b4%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C639045160121292737%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=rrMN3rrxqAnsqJyjTRrdezheYDkJDsy%2BNhfGhEZxIHE%3D&reserved=0

Do you run a recent Slurm version (24.11 and later are currently supported)?

I wonder if the error:

...
error: _fetch_child: failed to fetch remote configs: Protocol authentication error

is due to the network not yet being up after reboot? Restarting slurmd manually should hopefully work.

IHTH, Ole

On 1/19/26 14:48, Xaver Stiensmeier wrote:

...
Hey Ole,

thank you so much for your in detail documentation which leaves me both with answers and questions. Apparently, the aforementioned error had nothing to do with munge but with some issues regarding the reload of slurmd which I can't really reproduce. I think I somehow had two running and only killed one, but this is difficult to tell, because once I redid the entire setup, half the issue disappeared.

The remaining issue is that Slurmd can't start via systemctl as Slurmd never notifies systemctl that it is ready. I was able to fix this by setting:

[Service]     Type=simple

which allows the start and then Slurm is able to reach the node, config files are pulled as expected and I can schedule commands on the node.

While this leaves me with a running system, I still get:

ubuntu@worker:~$ systemctl status slurmd.service     ○ slurmd.service - Slurm node daemon      Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled;     preset: enabled)      Drop-In: /etc/systemd/system/slurmd.service.d      └─override.conf      Active: inactive (dead) since Mon 2026-01-19 13:31:28 UTC; 8min ago      Duration: 7ms      Process: 19712 ExecStart=/usr/sbin/slurmd --conf-server=master     (code=exited, status=0/SUCCESS)      Main PID: 19712 (code=exited, status=0/SUCCESS)      Tasks: 11 (limit: 19147)      Memory: 4.2M (peak: 6.4M)      CPU: 110ms      CGroup: /system.slice/slurmd.service      └─19714 /usr/sbin/slurmd --conf-server=master

Jan 19 13:31:28 worker systemd[1]: Started slurmd.service - Slurm node     daemon.     Jan 19 13:31:28 worker systemd[1]: slurmd.service: Deactivated     successfully.     Jan 19 13:31:28 worker systemd[1]: slurmd.service: Unit process 19713     (slurmd) remains running after unit stopped.     Jan 19 13:31:28 worker systemd[1]: slurmd.service: Unit process 19714     (slurmd) remains running after unit stopped.     Jan 19 13:31:28 worker slurmd[19716]: error: _fetch_child: failed to     fetch remote configs: Protocol authentication error     Jan 19 13:31:28 worker slurmd[19714]: error: _establish_configuration:     failed to load configs. Retrying in 10 seconds.

This leaves me with the guess that the initial fail that then succeeds might cause systemctl to abort early. Note that we setup our Slurm cluster via Ansible scripts so there might also be a race condition I am overlooking that causes parts of the authentication not being ready; however, this was not an issue before we tried configless.

Best, Xaver

On 1/16/26 12:11, Ole Holm Nielsen via slurm-users wrote:

...
Hi Xaver,

We have been running Configless Slurm for a number of years, and we're very happy with this setup. I have documented all the detailed configurations we made in this Wiki page, so maybe you want to consult this page:

https://eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Fwiki.fysik.dtu.dk%2FNiflheim_system%2FSlurm_configuration%2F&data=05%7C02%7COle.H.Nielsen%40fysik.dtu.dk%7C3f17379c9c164ab679aa08de582fe9b4%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C639045160121318015%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=1BibGElGI5AUTHIiamcUkSeq%2Bz%2BQzhR29KgkHE3uMHE%3D&reserved=0 #configless-slurm-setup

IHTH, Ole

On 1/16/26 11:11, Xaver Stiensmeier via slurm-users wrote:

...
Hey everyone,

in the past we set up clusters with configs on each node. Now we want to explore configless. Without changing anything else, we therefore followed: https://eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Fslurm.schedmd.com%2Fconfigless_slurm.html&data=05%7C02%7COle.H.Nielsen%40fysik.dtu.dk%7C3f17379c9c164ab679aa08de582fe9b4%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C639045160121334630%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=IPdH9dJMNc8660bV6sw1BKXme%2BTEQq2AQWHZfKbkNNg%3D&reserved=0 and added 'enable_configless' in the config on the master:

SlurmctldParameters=cloud_dns,idle_on_node_suspend,enable_configless,reconfig_on_restart

and start each worker's slurmd with the conf-server parameter:

# Override systemd service to set conditional path     [Service]     ExecStart=     ExecStart=/usr/sbin/slurmd --conf-server=master

However, this leads to:

slurmd: error: _fetch_child: failed to fetch remote configs: Protocol     authentication error

slurmd: error: _establish_configuration: failed to load configs.     Retrying in 10 seconds.

on the workers and on the master (/var/log/slurm/slurmctld) to:

[2026-01-16T10:00:06.681] error: Munge decode failed: Invalid credential     [2026-01-16T10:00:06.681] auth/munge: _print_cred: ENCODED: Thu Jan 01     00:00:00 1970     [2026-01-16T10:00:06.681] auth/munge: _print_cred: DECODED: Thu Jan 01     00:00:00 1970     [2026-01-16T10:00:06.681] error: slurm_unpack_received_msg:     [[worker]:24295] auth_g_verify: REQUEST_CONFIG has authentication     error: Unspecified error     [2026-01-16T10:00:06.681] error: slurm_unpack_received_msg:     [[worker]:24295] Protocol authentication error

The munge key setup is the same as before so I don't think there is anything wrong with it unless something changes with configless (slurm.conf):

AuthType=auth/munge     CryptoType=crypto/munge     AuthAltTypes=auth/jwt     AuthAltParameters=jwt_key=/etc/slurm/jwt-secret.key

I found https://eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Fgroups.google.com%2Fg%2Fslurm-users%2Fc%2FQ7FVkhx- bOs&data=05%7C02%7COle.H.Nielsen%40fysik.dtu.dk%7C3f17379c9c164ab679aa08de582fe9b4%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C639045160121350443%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=XdW3vfYmxo7IB32fOvxxC%2BAf1oD391CnKegMsvHTZlE%3D&reserved=0 but this seems unrelated as both can talk fine with each other:

worker:~$ nc -zv master 6817     Connection to master (192.168.20.169) 6817 port [tcp/*] succeeded!

I tried adding more "-v" to the slurmd start, but that did not give more information. I am unsure how to debug this further. Somehow I think it must be a munge issue, but I am confused as this part hasn't changed.

Best regards, Xaver

Xaver Stiensmeier

30 Jan 30 Jan

12:27 a.m.

Hey Ole,

I apologize for the late reply.

I receive nothing from `dig +short +search +ndots=2 -t SRV -n _slurmctld._tcp`, but shouldn't setting

cat /etc/systemd/system/slurmd.service.d/override.conf # Override systemd service to set conditional path # Type=simple [Service] ExecStart= ExecStart=/usr/sbin/slurmd --conf-server=master

be enough given the documentation?

The *--conf-server* options takes precedence over the DNS record.

But I think that you are right and somehow Slurm ignores the ip when Type is not simple. I am confused.

nc -vz master 6817 Connection to master (192.168.20.41) 6817 port [tcp/*] succeeded!

works and starting it later either via command line or via Type=simple works, too. I will try to dig deeper into the logs to maybe see whether the parameter gets skipped somehow, but still appreciate any help.

Best, Xaver

On 1/20/26 15:31, Ole Holm Nielsen via slurm-users wrote:

...

Hi Xaver,

Are you sure that your DNS SRV record is responding?

$ dig +short +search +ndots=2 -t SRV -n _slurmctld._tcp

See https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#testing-confi...

Best regards, Ole

On 1/20/26 15:26, Xaver Stiensmeier wrote:

...
Hey Ole,

we currently use Slurm 24.11 regarding the build process I have to get in touch with our Cloud admins as we build it and then offer it via mirror. However, I can confirm that it all worked without error before using configless.

The network is definitely already up as slurmd restarting does not help. However, I noticed that in these cases slurmd fails VERY quickly; it definitely does not wait for any timeout.

I primarily mentioned Ansible to support that I am pretty sure that the system is set up the same as before using configless.

Best, Xaver

On 1/20/26 13:39, Ole Holm Nielsen wrote:

...
Hi Xaver,

I have no experience with Ubuntu systems, which may behave differently from our RockyLinux 8. Setting up Slurm with Ansible should be fine, and this is also how we configure our Slurm servers and login nodes (but not slurmd nodes). Once Ansible is finished the system ought to work.

Did you build your Slurm packages with the Debian build system, see https://eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Fslurm.schedmd.com%2Fquickstart_admin.html%23debuild&data=05%7C02%7COle.H.Nielsen%40fysik.dtu.dk%7C3f17379c9c164ab679aa08de582fe9b4%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C639045160121292737%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=rrMN3rrxqAnsqJyjTRrdezheYDkJDsy%2BNhfGhEZxIHE%3D&reserved=0

Do you run a recent Slurm version (24.11 and later are currently supported)?

I wonder if the error:

...
error: _fetch_child: failed to fetch remote configs: Protocol authentication error

is due to the network not yet being up after reboot? Restarting slurmd manually should hopefully work.

IHTH, Ole

On 1/19/26 14:48, Xaver Stiensmeier wrote:

...
Hey Ole,

thank you so much for your in detail documentation which leaves me both with answers and questions. Apparently, the aforementioned error had nothing to do with munge but with some issues regarding the reload of slurmd which I can't really reproduce. I think I somehow had two running and only killed one, but this is difficult to tell, because once I redid the entire setup, half the issue disappeared.

The remaining issue is that Slurmd can't start via systemctl as Slurmd never notifies systemctl that it is ready. I was able to fix this by setting:

[Service]     Type=simple

which allows the start and then Slurm is able to reach the node, config files are pulled as expected and I can schedule commands on the node.

While this leaves me with a running system, I still get:

ubuntu@worker:~$ systemctl status slurmd.service     ○ slurmd.service - Slurm node daemon      Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled;     preset: enabled)      Drop-In: /etc/systemd/system/slurmd.service.d      └─override.conf      Active: inactive (dead) since Mon 2026-01-19 13:31:28 UTC; 8min ago      Duration: 7ms      Process: 19712 ExecStart=/usr/sbin/slurmd --conf-server=master     (code=exited, status=0/SUCCESS)      Main PID: 19712 (code=exited, status=0/SUCCESS)      Tasks: 11 (limit: 19147)      Memory: 4.2M (peak: 6.4M)      CPU: 110ms      CGroup: /system.slice/slurmd.service      └─19714 /usr/sbin/slurmd --conf-server=master

Jan 19 13:31:28 worker systemd[1]: Started slurmd.service - Slurm node     daemon.     Jan 19 13:31:28 worker systemd[1]: slurmd.service: Deactivated     successfully.     Jan 19 13:31:28 worker systemd[1]: slurmd.service: Unit process 19713     (slurmd) remains running after unit stopped.     Jan 19 13:31:28 worker systemd[1]: slurmd.service: Unit process 19714     (slurmd) remains running after unit stopped.     Jan 19 13:31:28 worker slurmd[19716]: error: _fetch_child: failed to     fetch remote configs: Protocol authentication error     Jan 19 13:31:28 worker slurmd[19714]: error: _establish_configuration:     failed to load configs. Retrying in 10 seconds.

This leaves me with the guess that the initial fail that then succeeds might cause systemctl to abort early. Note that we setup our Slurm cluster via Ansible scripts so there might also be a race condition I am overlooking that causes parts of the authentication not being ready; however, this was not an issue before we tried configless.

Best, Xaver

On 1/16/26 12:11, Ole Holm Nielsen via slurm-users wrote:

...
Hi Xaver,

We have been running Configless Slurm for a number of years, and we're very happy with this setup. I have documented all the detailed configurations we made in this Wiki page, so maybe you want to consult this page:

https://eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Fwiki.fysik.dtu.dk%2FNiflheim_system%2FSlurm_configuration%2F&data=05%7C02%7COle.H.Nielsen%40fysik.dtu.dk%7C3f17379c9c164ab679aa08de582fe9b4%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C639045160121318015%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=1BibGElGI5AUTHIiamcUkSeq%2Bz%2BQzhR29KgkHE3uMHE%3D&reserved=0 #configless-slurm-setup

IHTH, Ole

On 1/16/26 11:11, Xaver Stiensmeier via slurm-users wrote:

...
Hey everyone,

in the past we set up clusters with configs on each node. Now we want to explore configless. Without changing anything else, we therefore followed: https://eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Fslurm.schedmd.com%2Fconfigless_slurm.html&data=05%7C02%7COle.H.Nielsen%40fysik.dtu.dk%7C3f17379c9c164ab679aa08de582fe9b4%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C639045160121334630%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=IPdH9dJMNc8660bV6sw1BKXme%2BTEQq2AQWHZfKbkNNg%3D&reserved=0 and added 'enable_configless' in the config on the master:

SlurmctldParameters=cloud_dns,idle_on_node_suspend,enable_configless,reconfig_on_restart

and start each worker's slurmd with the conf-server parameter:

# Override systemd service to set conditional path     [Service]     ExecStart=     ExecStart=/usr/sbin/slurmd --conf-server=master

However, this leads to:

slurmd: error: _fetch_child: failed to fetch remote configs: Protocol     authentication error

slurmd: error: _establish_configuration: failed to load configs.     Retrying in 10 seconds.

on the workers and on the master (/var/log/slurm/slurmctld) to:

[2026-01-16T10:00:06.681] error: Munge decode failed: Invalid credential     [2026-01-16T10:00:06.681] auth/munge: _print_cred: ENCODED: Thu Jan 01     00:00:00 1970     [2026-01-16T10:00:06.681] auth/munge: _print_cred: DECODED: Thu Jan 01     00:00:00 1970     [2026-01-16T10:00:06.681] error: slurm_unpack_received_msg:     [[worker]:24295] auth_g_verify: REQUEST_CONFIG has authentication     error: Unspecified error     [2026-01-16T10:00:06.681] error: slurm_unpack_received_msg:     [[worker]:24295] Protocol authentication error

The munge key setup is the same as before so I don't think there is anything wrong with it unless something changes with configless (slurm.conf):

AuthType=auth/munge     CryptoType=crypto/munge     AuthAltTypes=auth/jwt     AuthAltParameters=jwt_key=/etc/slurm/jwt-secret.key

I found https://eur01.safelinks.protection.outlook.com/? url=https%3A%2F%2Fgroups.google.com%2Fg%2Fslurm-users%2Fc%2FQ7FVkhx- bOs&data=05%7C02%7COle.H.Nielsen%40fysik.dtu.dk%7C3f17379c9c164ab679aa08de582fe9b4%7Cf251f123c9ce448e927734bb285911d9%7C0%7C0%7C639045160121350443%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=XdW3vfYmxo7IB32fOvxxC%2BAf1oD391CnKegMsvHTZlE%3D&reserved=0 but this seems unrelated as both can talk fine with each other:

worker:~$ nc -zv master 6817     Connection to master (192.168.20.169) 6817 port [tcp/*] succeeded!

I tried adding more "-v" to the slurmd start, but that did not give more information. I am unsure how to debug this further. Somehow I think it must be a munge issue, but I am confused as this part hasn't changed.

Best regards, Xaver

Age (days ago)

Last active (days ago)

slurm-users@lists.schedmd.com

6 comments

2 participants

tags (0)

participants (2)

Ole Holm Nielsen
Xaver Stiensmeier