[slurm-users] GraceTime is not working, But there is log.
Rémi Palancher
remi at rackslab.io
Wed Nov 8 07:59:01 UTC 2023
Le 08/11/2023 à 02:28, 김형진 a écrit :
> Hello ~____
>
> …
>
> However, as soon as the base QoS job is created, the large QoS job is
> immediately canceled without any waiting time.____
>
> __ __
>
> But in the slurmctld log, there is a grace time log.____
>
> [2023-11-02T11:37:36.589] debug: setting 3600 sec preemption grace time
> for JobId=153 to reclaim resources for JobId=154____
>
> __ __
>
> Could you help me understand what might be going wrong?____
Note that Slurm sends SIGTERM signal by default to slurmstepd immediate
children (which might be gpu_burn in your case) at _the beginning_ of
the GraceTime, to notify them of approaching termination.
If the processes react to SIGTERM by terminating, which generally the
case, you may have the impression GraceTime is not honored.
To benefit from the GraceTime, your program must either trap SIGTERM
with a signal handler or you must enable send_user_signal
PreemptParameters flag and submit your job with --signal and another signal.
--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io/
More information about the slurm-users
mailing list