[slurm-users] All user's jobs killed at the same time on all nodes

Fri Jun 29 05:03:53 MDT 2018

Hi Matteo,

On Fri, Jun 29, 2018 at 10:13:33AM +0000, Matteo Guglielmi wrote:

> Dear comunity,
> 
> I have a user who usually submits 36 (identical) jobs at a time using a simple for loop,
> thus jobs are sbatched all the same time.
> 
> Each job requests a single core and all jobs are independent from one another (read
> different input files and write to different output files).
> 
> Jobs are then usually started during the next couple of hours, somewhat at random
> times.
> 
> What happens then is that after a certain amount of time (maybe from 2 to 12 hours)
> ALL jobs belonging to this particular user are killed by slurm on all nodes at exactly the
> same time.
> 
> One example:
> 
> ### master: /var/log/slurmctld.log ###
> 
> [2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560 InitPrio=4294185624 usec=255
> ...
> [2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on node38
> ...
> [2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 718560 uid 1007
> [2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560 State=0x8004 NodeCnt=1 successful 0x8004

That line looks like the user (presuming that uid 1007 is them; otherwise
it's an operator who can kill jobs) killed their job.

Have a look in the slurmctld.log for more lines with 'REQUEST_KILL_JOB'; if
they all appear at basically the same time, then it looks like uid 1007 did
something like 'scancel -u theusername'.

That might not be it, but that would be my first guess.

Paddy

-- 
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/