[slurm-users] draining nodes due to failed killing of task?

Christopher Samuel chris at csamuel.org
Sun Aug 8 18:42:12 UTC 2021


On 8/7/21 11:47 pm, Adrian Sevcenco wrote:

> yes, the jobs that are running have a part of file saving if they are 
> killed,
> saving which depending of the target can get stuck ...
> i have to think for a way to take a processes snapshot when this happens ..

Slurm does let you request a signal a certain amount of time before the 
job is due to end, you could make your job use that to do the checkpoint 
in advance of the end of the job so you don't hit this problem.

Look at the --signal option in "man sbatch".

Best of luck!
Chris
-- 
   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



More information about the slurm-users mailing list