Hi, sorry for the late reply.
We tested your proposal and can confirm that all nodes have each other on their respective /etc/hosts.We can also confirm that the slurmd port is not blocked.Â
One thing we found useful to reproduce the issue is that if we run srun -w <node x> and on another session srun -w <node x>, the second srun waits for resources while the first one gets into <node x>. If we exit the session on the first shell, the one that was waiting gets error: security violation/invalid job credentials instead of getting into <node x>.
We also found that scontrol ping not only fails on the login node, but also on the nodes of a specific partition, showing the larger message:
Slurmctld(primary) at <headnode> is DOWN
*****************************************
** RESTORE SLURMCTLD DAEMON TO SERVICE **
*****************************************
Still, slurm is able to assign those nodes for jobs.
We also raised debug to the max on slurmctld, and when doing the scontrol ping, we get this log:
[2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00 1969Â
[2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00 1969Â
[2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] auth_g_verify: REQUEST_PING has authentication error: Unspecified errorÂ
[2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55274] Protocol authentication errorÂ
[2025-09-24T14:45:16] error: slurm_receive_msg [
172.28.253.11:55274]: Protocol authentication errorÂ
[2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential for client UID=202 GID=202Â
[2025-09-24T14:45:16] auth/munge: _print_cred: ENCODED: Wed Dec 31 21:00:00 1969
[2025-09-24T14:45:16] auth/munge: _print_cred: DECODED: Wed Dec 31 21:00:00 1969Â
[2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] auth_g_verify: REQUEST_PING has authentication error: Unspecified errorÂ
[2025-09-24T14:45:16] error: slurm_unpack_received_msg: [[snmgt01]:55286] Protocol authentication error
[2025-09-24T14:45:16] error: slurm_receive_msg [
172.28.253.11:55286]: Protocol authentication errorÂ
[2025-09-24T14:45:16] error: Munge decode failed: Unauthorized credential for client UID=202 GID=202
I find it suspicious that the date munge shows is Wed Dec 31 21:00:00 1969. I checked correct ownership of the munge.key and that all nodes have the same file.
Does anyone has more documentation on what scontrol ping does? We haven't found detailed information on the docs.
Best regards,
Bruno Bruzzo
System Administrator - Clementina XXI