Update:

We have solved the issue.

Our problem was that even tough we have a configless configuration, our provisioning served a unconfigured slurm.conf file to /etc/slurm

On the failing nodes, we could see:

scontrol show config | grep -i "hash_val"

cn080: HASH_VAL                = Different Ours=<...> a Slurmctld=<...>

While on working nodes we saw:

scontrol show config | grep -i "hash_val"

cn044: HASH_VAL                = Match

Note: The failing nodes could still get jobs scheduled via sbatch. The issue was with srun/salloc.

We removed the slurm.conf file, restarted services, and for now, everything works fine.

Thanks for the support.

Bruno Bruzzo
System Administrator - Clementina XXI

El mié, 24 sept 2025 a la(s) 3:51 p.m., John Hearns (hearnsj@gmail.com) escribió:

Shot down in 🔥🔥

On Wed, Sep 24, 2025, 7:43 PM Bruno Bruzzo <bbruzzo@dc.uba.ar> wrote:
Yes, all nodes are synchronized with crony.

El mié, 24 sept 2025 a la(s) 3:28 p.m., John Hearns (hearnsj@gmail.com) escribió:
Err., are all your nodes on the same time?

Actually slurmd will not start if a compute node is too far away in time from the controller node. So you should be OK

I would still check the times on all nodes are in agreement

On Wed, Sep 24, 2025, 7:19 PM Bruno Bruzzo via slurm-users <slurm-users@lists.schedmd.com> wrote:
Hi, sorry for the late reply.

We tested your proposal and can confirm that all nodes have each other on their respective /etc/hosts.We can also confirm that the slurmd port is not blocked.

One thing we found useful to reproduce the issue is that if we run srun -w <node x> and on another session srun -w <node x>, the second srun waits for resources while the first one gets into <node x>. If we exit the session on the first shell, the one that was waiting gets error: security violation/invalid job credentials instead of getting into <node x>.

We also found that scontrol ping not only fails on the login node, but also on the nodes of a specific partition, showing the larger message:

Slurmctld(primary) at <headnode> is DOWN
*****************************************
** RESTORE SLURMCTLD DAEMON TO SERVICE **
*****************************************
Still, slurm is able to assign those nodes for jobs.

We also raised debug to the max on slurmctld, and when doing the scontrol ping, we get this log:
[2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00 1969
[2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00 1969
[2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] auth_g_verify: REQUEST_PING has authentication error: Unspecified error
[2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] Protocol authentication error
[2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55274]: Protocol authentication error
[2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential for client UID=202 GID=202
[2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00 1969 [2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00 1969
[2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] auth_g_verify: REQUEST_PING has authentication error: Unspecified error
[2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] Protocol authentication error [2025-09-24T14:45:16] error: slurm_receive_msg [172.28.253.11:55286]: Protocol authentication error
[2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential for client UID=202 GID=202

I find it suspicious that the date munge shows is Wed Dec 31 21:00:00 1969. I checked correct ownership of the munge.key and that all nodes have the same file.

Does anyone has more documentation on what scontrol ping does? We haven't found detailed information on the docs.

Best regards,
Bruno Bruzzo
System Administrator - Clementina XXI

El vie, 29 ago 2025 a la(s) 3:47 a.m., Bjørn-Helge Mevik via slurm-users (slurm-users@lists.schedmd.com) escribió:
Bruno Bruzzo via slurm-users <slurm-users@lists.schedmd.com> writes:

> slurmctld runs on management node mmgt01.
> srun and salloc fail intermittently on login node, that means
> we can successfully use srun on login node from time to time, but it
> stops working for a while without us changing any configuration.

This, to me, sounds like there could be a problem on the compute nodes,
or the communication between logins and computes. One thing that have
bit me several times over the years, is compute nodes missing from
/etc/hosts on other compute nodes. Slurmctld is often sending messages
to computes via other computes, and if the messages happen go go via a
node that does not have the target compute in its /etc/hosts, it cannot
forward the message.

Another thing to look out for, is to check whether any nodes running
slurmd (computes or logins) have their slurmd port blocked by firewalld
or something else.

> scontrol ping always shows DOWN from login node, even when we can
> successfully
> run srun or salloc.

This might indicate that the slurmctld port on mmgt01 is blocked, or the
slurmd port on the logins.

It might be something completely different, but I'd at least check /etc/hosts
on all nodes (controller, logins, computes) and check that all needed
ports are unblocked.

--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com