[slurm-users] What can cause a job to get killed?

Andy Riebs andy.riebs at hpe.com
Tue Apr 17 10:39:56 MDT 2018


I had a job running last night, with a 30 minute timeout. (It's a 
well-tested script that runs multiple times daily.)

On one run, in a middle of a set of runs for this job, I got this on the 
console after about 8 minutes:

srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 617845.0 ON node01 CANCELLED AT 
2018-04-17T00:36:58 ***
srun: error: node14 : tasks 3680,3682-3683: Killed
srun: Terminating job step 617845.0

Slurmctld duly reports that the job terminated with "WTERMSIG 9", and 
the slurmd logs also indicate "task XXXX (YYYYY) exited. Killed by 
signal 9."

Any thoughts about why a job would get cancelled without getting any 
more detail than this?

Andy

-- 
Andy Riebs
andy.riebs at hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
     May the source be with you!




More information about the slurm-users mailing list