[slurm-users] Re: SRUN and SBATCH network issues on configless login node.

29 Aug 2025


      Bruno Bruzzo via slurm-users slurm-users@lists.schedmd.com writes:
...
slurmctld runs on management node mmgt01.
srun and salloc fail intermittently on login node, that means
we can successfully use srun on login node from time to time, but it
stops working for a while without us changing any configuration.
This, to me, sounds like there could be a problem on the compute nodes,
or the communication between logins and computes.  One thing that have
bit me several times over the years, is compute nodes missing from
/etc/hosts on other compute nodes.  Slurmctld is often sending messages
to computes via other computes, and if the messages happen go go via a
node that does not have the target compute in its /etc/hosts, it cannot
forward the message.
Another thing to look out for, is to check whether any nodes running
slurmd (computes or logins) have their slurmd port blocked by firewalld
or something else.
...
scontrol ping always shows DOWN from login node, even when we can
successfully
run srun or salloc.
This might indicate that the slurmctld port on mmgt01 is blocked, or the
slurmd port on the logins.
It might be something completely different, but I'd at least check /etc/hosts
on all nodes (controller, logins, computes) and check that all needed
ports are unblocked.
-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

2026

2025

2024

[slurm-users] Re: SRUN and SBATCH network issues on configless login node.