[slurm-users] GraceTime is not working, But there is log.

김형진 jinsbiz13 at gmail.com
Thu Nov 9 01:33:01 UTC 2023


Thank you for your response. Thanks to your explanation, I was able to
understand.

After writing and running a new test program that only logs on SIGTERM, I
could confirm that the GraceTime was applied.

Thank you once again.

Below is a sample code for reference for others:

$ cat run-gpu.cu
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <cuda_runtime.h>


void sigterm_handler(int signum) {
    printf("Received SIGTERM, but not terminating\n");

}


__global__ void dummy_kernel(int *data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    data[idx] = idx;
}

int main() {
    signal(SIGTERM, sigterm_handler);

    int *device_data;
    cudaMalloc((void **)&device_data, 1024 * sizeof(int));


    dummy_kernel<<<1, 1024>>>(device_data);
    cudaDeviceSynchronize();

    while(1) {
        sleep(1);
        printf("Working with GPU...\n");
    }

    cudaFree(device_data);
    return 0;
}



2023년 11월 8일 (수) 오후 5:02, Rémi Palancher <remi at rackslab.io>님이 작성:

> Le 08/11/2023 à 02:28, 김형진 a écrit :
> > Hello ~____
> >
> > …
> >
> > However, as soon as the base QoS job is created, the large QoS job is
> > immediately canceled without any waiting time.____
> >
> > __ __
> >
> > But in the slurmctld log, there is a grace time log.____
> >
> > [2023-11-02T11:37:36.589] debug:  setting 3600 sec preemption grace time
> > for JobId=153 to reclaim resources for JobId=154____
> >
> > __ __
> >
> > Could you help me understand what might be going wrong?____
>
> Note that Slurm sends SIGTERM signal by default to slurmstepd immediate
> children (which might be gpu_burn in your case) at _the beginning_ of
> the GraceTime, to notify them of approaching termination.
>
> If the processes react to SIGTERM by terminating, which generally the
> case, you may have the impression GraceTime is not honored.
>
> To benefit from the GraceTime, your program must either trap SIGTERM
> with a signal handler or you must enable send_user_signal
> PreemptParameters flag and submit your job with --signal and another
> signal.
>
> --
> Rémi Palancher
> Rackslab: Open Source Solutions for HPC Operations
> https://rackslab.io/
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231109/26a15412/attachment.htm>


More information about the slurm-users mailing list