[slurm-users] Preemption: Not receiving signal

nico.faerber at id.unibe.ch nico.faerber at id.unibe.ch
Fri Oct 19 15:31:09 MDT 2018


Hi all,

according to the SLURM documentation,  SIGCONT and SIGTERM signals are sent twice to a job that is selected for preemption:

“Once a job has been selected for preemption, its end time is set to the current time plus GraceTime. The job is immediately sent SIGCONT and SIGTERM signals in order to provide notification of its imminent termination. This is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence upon reaching its new end time.”

While I can trap the first SIGTERM in a job submitted with srun or in a job step launched with srun (from inside a batch script submitted with sbatch), I cannot trap the first SIGTERM in a batch script submitted with sbatch, i.e. the batch script only receives a SIGTERM after GraceTime has expired. Why is the first SIGTERM not sent to the batch shell? I use the following test job:

#!/bin/bash

housekeeping() {
        echo "$(date): Cleaning up..." >> job.log
        sleep 10
        echo "$(date): Done." >> job.log
        exit 1
}

trap 'housekeeping' TERM

echo "$(date): Starting batch job." >> job.log

while true; do
        sleep 2 &
        wait $!
done

exit 0

Example: Submitting the test job with sbatch:

SubmitTime=2018-10-18T15:01:52 EligibleTime=2018-10-18T15:01:52
StartTime=2018-10-18T15:01:54 EndTime=2018-10-18T15:03:13 Deadline=N/A
PreemptTime=2018-10-18T15:02:13 SuspendTime=None SecsPreSuspend=0

job.log:
Thu Oct 18 15:01:54 CEST 2018: Starting batch job.
Thu Oct 18 15:03:24 CEST 2018: Cleaning up...
Thu Oct 18 15:03:34 CEST 2018: Done.

Example: Submitting the test job with srun:

SubmitTime=2018-10-18T15:08:52 EligibleTime=2018-10-18T15:08:52
StartTime=2018-10-18T15:08:52 EndTime=2018-10-18T15:09:50 Deadline=N/A
PreemptTime=2018-10-18T15:09:40 SuspendTime=None SecsPreSuspend=0

job.log:
Thu Oct 18 15:08:52 CEST 2018: Starting batch job.
Thu Oct 18 15:09:40 CEST 2018: Cleaning up...
Thu Oct 18 15:09:50 CEST 2018: Done.

Slurm version 17.02.10

slurm.conf:
(…)
PreemptType=preempt/qos
PreemptMode=CANCEL
PartitionName=low-prio Nodes=node[01-09] DefaultTime=01:00:00 MaxTime=24:00:00 DefMemPerCPU=2020 GraceTime=60 State=UP QOS=part_gpu
(…)

Is this the intended behavior or am I missing something? It seems that the only way to perform housekeeping from inside a batch script is to use the --signal option, e.g. --signal=B:TERM at 60 or the extra time provided by KillWait. Can anybody confirm?

Thank you!

---
Universität Bern
Informatikdienste
Gruppe Systemdienste

Nico Färber
Systemadministrator HPC

Hochschulstrasse 6
CH-3012 Bern
Tel. +41 (0)31 631 51 89

mailto: grid-support at id.unibe.ch<mailto:grid-support at id.unibe.ch>
http://www.id.unibe.ch/


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181019/7a1eaf86/attachment-0001.html>


More information about the slurm-users mailing list