[slurm-users] srun fails with "srun: error: Security violation, slurm message from uid" if delay in job starting
Mark Dixon
mark.c.dixon at durham.ac.uk
Tue Dec 14 08:49:27 UTC 2021
Hi all,
Sorry for the noise, this was down to a problem with our configless setup.
Really must start running slurmd everywhere and get rid of the
compute-only version of slurm.conf...
Cheers,
Mark
On Mon, 13 Dec 2021, Mark Dixon wrote:
> [EXTERNAL EMAIL]
>
> Hi all,
>
> Just wondering if anyone else had seen this.
>
> Running slurm 21.08.2, we're seeing srun work normally if it is able to
> run immediately. However, if there is a delay in job start, for example
> after a wait for another job to end, srun fails. e.g.
>
> [test at foo ~]$ srun -p test --pty bash
> [test at bar ~]$ exit
> exit
> [test at foo ~]$
>
> [test at foo ~]$ sbatch -p test --exclusive sleep.sh
> Submitted batch job 3407
> [test at foo ~]$ srun -p test --pty bash
> srun: job 3409 queued and waiting for resources
> srun: error: Security violation, slurm message from uid 456
> srun: error: Security violation, slurm message from uid 456
> srun: error: Job allocation 3409 has been revoked
> [test at foo ~]$
>
> With --slurmd-debug=verbose, I see:
>
> srun: job 3390 queued and waiting for resources
> srun: error: Security violation, slurm message from uid 456
> srun: error: Security violation, slurm message from uid 456
> srun: error: Job allocation 3390 has been revoked
>
> Meanwhile, the slurmd log shows:
>
> [2021-12-13T13:08:06.028] Job 3390 already killed, do not launch extern step
>
>
> Any ideas, please?
>
> Thanks!
>
> Mark
>
>
>
More information about the slurm-users
mailing list