[slurm-users] All user's jobs killed at the same time on all nodes
paddy at tchpc.tcd.ie
Fri Jun 29 05:03:53 MDT 2018
On Fri, Jun 29, 2018 at 10:13:33AM +0000, Matteo Guglielmi wrote:
> Dear comunity,
> I have a user who usually submits 36 (identical) jobs at a time using a simple for loop,
> thus jobs are sbatched all the same time.
> Each job requests a single core and all jobs are independent from one another (read
> different input files and write to different output files).
> Jobs are then usually started during the next couple of hours, somewhat at random
> What happens then is that after a certain amount of time (maybe from 2 to 12 hours)
> ALL jobs belonging to this particular user are killed by slurm on all nodes at exactly the
> same time.
> One example:
> ### master: /var/log/slurmctld.log ###
> [2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560 InitPrio=4294185624 usec=255
> [2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on node38
> [2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 718560 uid 1007
> [2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560 State=0x8004 NodeCnt=1 successful 0x8004
That line looks like the user (presuming that uid 1007 is them; otherwise
it's an operator who can kill jobs) killed their job.
Have a look in the slurmctld.log for more lines with 'REQUEST_KILL_JOB'; if
they all appear at basically the same time, then it looks like uid 1007 did
something like 'scancel -u theusername'.
That might not be it, but that would be my first guess.
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
More information about the slurm-users