Hi all, first of all, sorry for my English, it's not my native language.
We are currently experiencing an issue with srun and salloc on our
login nodes, while sbatch works properly.
slurm version 23.11.4.
slurmctld runs on management node mmgt01.
srun and salloc fail intermittently on login node, that means
we can successfully use srun on login node from time to time, but it
stops working for a while without us changing any configuration.
login nodes reports the following to the user:
$ srun -N 1 -n 64 --partition=gpunode --pty /bin/bash
srun: job 4872 queued and waiting for resources
srun: error: Security violation, slurm message from uid 202
srun: error: Security violation, slurm message from uid 202
srun: error: Security violation, slurm message from uid 202
srun: error: Task launch for StepId=4872.0 failed on node cn065: Invalid
job
credential
srun: error: Application launch failed: Invalid job credential
srun: Job step aborted
uid 202 is the slurm user.
On the server side, slurmctld logs show:
sched: _slurm_rpc_allocate_resources JobId=4872 NodeList=(null) usec=228
sched: Allocate JobId=4872 NodeList=cn065 #CPUs=64 Partition=gpunode
error: slurm_receive_msgs: [[snmgt01]:38727] failed: Zero Bytes were
transmitted or received
Killing interactive JobId=4872: Communication connection failure
_job_complete: JobId=4872 WEXITSTATUS 1
_job_complete: JobId=4872 done
step_partial_comp: JobId=4872 StepID=0 invalid; this step may have already
completed
_slurm_rpc_complete_job_allocation: JobId=4872 error Job/step already
completing or completed
We suspect it is a network issue regarding the Zero Bytes were transmitted
or received.
The configless system is working properly. Slurmd on login
node can read changes made at slurm.conf after a scontrol reconfig.
srun runs successfully from management nodes and from compute nodes.
The issue is from the login node.
scontrol ping always shows DOWN from login node, even when we can
successfully
run srun or salloc.
$ scontrol ping
Slurmctld(primary) at mmgt01 is DOWN
We checked as well for munge consistency.
mmgt and login nodes have the hostnames of their respective others on
/etc/hosts. They can communicate.
We would really appreciate some tips on what we could be missing.
Best regards,
Bruno Bruzzo
System Administrator - Clementina XXI