Hi all, first of all, sorry for my English, it's not my native language.
We are currently experiencing an issue with srun and salloc on our login nodes, while sbatch works properly.
slurm version 23.11.4.
slurmctld runs on management node mmgt01. srun and salloc fail intermittently on login node, that means we can successfully use srun on login node from time to time, but it stops working for a while without us changing any configuration.
login nodes reports the following to the user:
$ srun -N 1 -n 64 --partition=gpunode --pty /bin/bash srun: job 4872 queued and waiting for resources srun: error: Security violation, slurm message from uid 202 srun: error: Security violation, slurm message from uid 202 srun: error: Security violation, slurm message from uid 202 srun: error: Task launch for StepId=4872.0 failed on node cn065: Invalid job credential srun: error: Application launch failed: Invalid job credential srun: Job step aborted
uid 202 is the slurm user.
On the server side, slurmctld logs show:
sched: _slurm_rpc_allocate_resources JobId=4872 NodeList=(null) usec=228 sched: Allocate JobId=4872 NodeList=cn065 #CPUs=64 Partition=gpunode error: slurm_receive_msgs: [[snmgt01]:38727] failed: Zero Bytes were transmitted or received Killing interactive JobId=4872: Communication connection failure _job_complete: JobId=4872 WEXITSTATUS 1 _job_complete: JobId=4872 done step_partial_comp: JobId=4872 StepID=0 invalid; this step may have already completed _slurm_rpc_complete_job_allocation: JobId=4872 error Job/step already completing or completed
We suspect it is a network issue regarding the Zero Bytes were transmitted or received.
The configless system is working properly. Slurmd on login node can read changes made at slurm.conf after a scontrol reconfig.
srun runs successfully from management nodes and from compute nodes. The issue is from the login node.
scontrol ping always shows DOWN from login node, even when we can successfully run srun or salloc.
$ scontrol ping Slurmctld(primary) at mmgt01 is DOWN
We checked as well for munge consistency.
mmgt and login nodes have the hostnames of their respective others on /etc/hosts. They can communicate.
We would really appreciate some tips on what we could be missing.
Best regards, Bruno Bruzzo System Administrator - Clementina XXI
Bruno Bruzzo via slurm-users slurm-users@lists.schedmd.com writes:
slurmctld runs on management node mmgt01. srun and salloc fail intermittently on login node, that means we can successfully use srun on login node from time to time, but it stops working for a while without us changing any configuration.
This, to me, sounds like there could be a problem on the compute nodes, or the communication between logins and computes. One thing that have bit me several times over the years, is compute nodes missing from /etc/hosts on other compute nodes. Slurmctld is often sending messages to computes via other computes, and if the messages happen go go via a node that does not have the target compute in its /etc/hosts, it cannot forward the message.
Another thing to look out for, is to check whether any nodes running slurmd (computes or logins) have their slurmd port blocked by firewalld or something else.
scontrol ping always shows DOWN from login node, even when we can successfully run srun or salloc.
This might indicate that the slurmctld port on mmgt01 is blocked, or the slurmd port on the logins.
It might be something completely different, but I'd at least check /etc/hosts on all nodes (controller, logins, computes) and check that all needed ports are unblocked.