[slurm-users] GraceTime is not working, But there is log.

Rémi Palancher remi at rackslab.io
Wed Nov 8 07:59:01 UTC 2023


Le 08/11/2023 à 02:28, 김형진 a écrit :
> Hello ~____
> 
>> 
> However, as soon as the base QoS job is created, the large QoS job is 
> immediately canceled without any waiting time.____
> 
> __ __
> 
> But in the slurmctld log, there is a grace time log.____
> 
> [2023-11-02T11:37:36.589] debug:  setting 3600 sec preemption grace time 
> for JobId=153 to reclaim resources for JobId=154____
> 
> __ __
> 
> Could you help me understand what might be going wrong?____

Note that Slurm sends SIGTERM signal by default to slurmstepd immediate 
children (which might be gpu_burn in your case) at _the beginning_ of 
the GraceTime, to notify them of approaching termination.

If the processes react to SIGTERM by terminating, which generally the 
case, you may have the impression GraceTime is not honored.

To benefit from the GraceTime, your program must either trap SIGTERM 
with a signal handler or you must enable send_user_signal 
PreemptParameters flag and submit your job with --signal and another signal.

-- 
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io/




More information about the slurm-users mailing list