[slurm-users] draining nodes due to failed killing of task?
chris at csamuel.org
Sun Aug 8 18:42:12 UTC 2021
On 8/7/21 11:47 pm, Adrian Sevcenco wrote:
> yes, the jobs that are running have a part of file saving if they are
> saving which depending of the target can get stuck ...
> i have to think for a way to take a processes snapshot when this happens ..
Slurm does let you request a signal a certain amount of time before the
job is due to end, you could make your job use that to do the checkpoint
in advance of the end of the job so you don't hit this problem.
Look at the --signal option in "man sbatch".
Best of luck!
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
More information about the slurm-users