<div dir="ltr"><p style="border:0px solid rgb(217,217,227);box-sizing:border-box;margin:0px 0px 1.25em;color:rgb(55,65,81);font-family:Söhne,ui-sans-serif,system-ui,-apple-system,"Segoe UI",Roboto,Ubuntu,Cantarell,"Noto Sans",sans-serif,"Helvetica Neue",Arial,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol","Noto Color Emoji";font-size:16px;background-color:rgb(247,247,248)">Thank you for your response.
Thanks to your explanation, I was able to understand.</p><p style="border:0px solid rgb(217,217,227);box-sizing:border-box;margin:1.25em 0px;color:rgb(55,65,81);font-family:Söhne,ui-sans-serif,system-ui,-apple-system,"Segoe UI",Roboto,Ubuntu,Cantarell,"Noto Sans",sans-serif,"Helvetica Neue",Arial,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol","Noto Color Emoji";font-size:16px;background-color:rgb(247,247,248)">After writing and running a new test program that only logs on SIGTERM, I could confirm that the GraceTime was applied.</p><p style="border:0px solid rgb(217,217,227);box-sizing:border-box;margin:1.25em 0px;color:rgb(55,65,81);font-family:Söhne,ui-sans-serif,system-ui,-apple-system,"Segoe UI",Roboto,Ubuntu,Cantarell,"Noto Sans",sans-serif,"Helvetica Neue",Arial,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol","Noto Color Emoji";font-size:16px;background-color:rgb(247,247,248)">Thank you once again.</p><p style="border:0px solid rgb(217,217,227);box-sizing:border-box;margin:1.25em 0px;color:rgb(55,65,81);font-family:Söhne,ui-sans-serif,system-ui,-apple-system,"Segoe UI",Roboto,Ubuntu,Cantarell,"Noto Sans",sans-serif,"Helvetica Neue",Arial,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol","Noto Color Emoji";font-size:16px;background-color:rgb(247,247,248)">Below is a sample code for reference for others:</p><p style="border:0px solid rgb(217,217,227);box-sizing:border-box;margin:1.25em 0px;color:rgb(55,65,81);font-family:Söhne,ui-sans-serif,system-ui,-apple-system,"Segoe UI",Roboto,Ubuntu,Cantarell,"Noto Sans",sans-serif,"Helvetica Neue",Arial,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol","Noto Color Emoji";font-size:16px;background-color:rgb(247,247,248)">$ cat <a href="http://run-gpu.cu">run-gpu.cu</a><br>#include <stdio.h><br>#include <stdlib.h><br>#include <signal.h><br>#include <unistd.h><br>#include <cuda_runtime.h><br><br><br>void sigterm_handler(int signum) {<br> printf("Received SIGTERM, but not terminating\n");<br> <br>}<br><br><br>__global__ void dummy_kernel(int *data) {<br> int idx = blockIdx.x * blockDim.x + threadIdx.x;<br> data[idx] = idx;<br>}<br><br>int main() {<br> signal(SIGTERM, sigterm_handler);<br><br> int *device_data;<br> cudaMalloc((void **)&device_data, 1024 * sizeof(int));<br><br> <br> dummy_kernel<<<1, 1024>>>(device_data);<br> cudaDeviceSynchronize();<br><br> while(1) {<br> sleep(1);<br> printf("Working with GPU...\n");<br> }<br><br> cudaFree(device_data);<br> return 0;<br>}<br></p><p style="border:0px solid rgb(217,217,227);box-sizing:border-box;margin:1.25em 0px;color:rgb(55,65,81);font-family:Söhne,ui-sans-serif,system-ui,-apple-system,"Segoe UI",Roboto,Ubuntu,Cantarell,"Noto Sans",sans-serif,"Helvetica Neue",Arial,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol","Noto Color Emoji";font-size:16px;background-color:rgb(247,247,248)"><br></p></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">2023년 11월 8일 (수) 오후 5:02, Rémi Palancher <<a href="mailto:remi@rackslab.io">remi@rackslab.io</a>>님이 작성:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Le 08/11/2023 à 02:28, 김형진 a écrit :<br>
> Hello ~____<br>
> <br>
> …<br>
> <br>
> However, as soon as the base QoS job is created, the large QoS job is <br>
> immediately canceled without any waiting time.____<br>
> <br>
> __ __<br>
> <br>
> But in the slurmctld log, there is a grace time log.____<br>
> <br>
> [2023-11-02T11:37:36.589] debug: setting 3600 sec preemption grace time <br>
> for JobId=153 to reclaim resources for JobId=154____<br>
> <br>
> __ __<br>
> <br>
> Could you help me understand what might be going wrong?____<br>
<br>
Note that Slurm sends SIGTERM signal by default to slurmstepd immediate <br>
children (which might be gpu_burn in your case) at _the beginning_ of <br>
the GraceTime, to notify them of approaching termination.<br>
<br>
If the processes react to SIGTERM by terminating, which generally the <br>
case, you may have the impression GraceTime is not honored.<br>
<br>
To benefit from the GraceTime, your program must either trap SIGTERM <br>
with a signal handler or you must enable send_user_signal <br>
PreemptParameters flag and submit your job with --signal and another signal.<br>
<br>
-- <br>
Rémi Palancher<br>
Rackslab: Open Source Solutions for HPC Operations<br>
<a href="https://rackslab.io/" rel="noreferrer" target="_blank">https://rackslab.io/</a><br>
<br>
<br>
</blockquote></div>