Bruno Bruzzo via slurm-users slurm-users@lists.schedmd.com writes:
slurmctld runs on management node mmgt01. srun and salloc fail intermittently on login node, that means we can successfully use srun on login node from time to time, but it stops working for a while without us changing any configuration.
This, to me, sounds like there could be a problem on the compute nodes, or the communication between logins and computes. One thing that have bit me several times over the years, is compute nodes missing from /etc/hosts on other compute nodes. Slurmctld is often sending messages to computes via other computes, and if the messages happen go go via a node that does not have the target compute in its /etc/hosts, it cannot forward the message.
Another thing to look out for, is to check whether any nodes running slurmd (computes or logins) have their slurmd port blocked by firewalld or something else.
scontrol ping always shows DOWN from login node, even when we can successfully run srun or salloc.
This might indicate that the slurmctld port on mmgt01 is blocked, or the slurmd port on the logins.
It might be something completely different, but I'd at least check /etc/hosts on all nodes (controller, logins, computes) and check that all needed ports are unblocked.