[slurm-users] srun fails with "srun: error: Security violation, slurm message from uid" if delay in job starting
Mark Dixon
mark.c.dixon at durham.ac.uk
Mon Dec 13 13:41:26 UTC 2021
Hi all,
Just wondering if anyone else had seen this.
Running slurm 21.08.2, we're seeing srun work normally if it is able to
run immediately. However, if there is a delay in job start, for example
after a wait for another job to end, srun fails. e.g.
[test at foo ~]$ srun -p test --pty bash
[test at bar ~]$ exit
exit
[test at foo ~]$
[test at foo ~]$ sbatch -p test --exclusive sleep.sh
Submitted batch job 3407
[test at foo ~]$ srun -p test --pty bash
srun: job 3409 queued and waiting for resources
srun: error: Security violation, slurm message from uid 456
srun: error: Security violation, slurm message from uid 456
srun: error: Job allocation 3409 has been revoked
[test at foo ~]$
With --slurmd-debug=verbose, I see:
srun: job 3390 queued and waiting for resources
srun: error: Security violation, slurm message from uid 456
srun: error: Security violation, slurm message from uid 456
srun: error: Job allocation 3390 has been revoked
Meanwhile, the slurmd log shows:
[2021-12-13T13:08:06.028] Job 3390 already killed, do not launch extern step
Any ideas, please?
Thanks!
Mark
More information about the slurm-users
mailing list