[slurm-users] GraceTime is not working, But there is log.
김형진
jinsbiz13 at gmail.com
Thu Nov 9 01:33:01 UTC 2023
Thank you for your response. Thanks to your explanation, I was able to
understand.
After writing and running a new test program that only logs on SIGTERM, I
could confirm that the GraceTime was applied.
Thank you once again.
Below is a sample code for reference for others:
$ cat run-gpu.cu
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <cuda_runtime.h>
void sigterm_handler(int signum) {
printf("Received SIGTERM, but not terminating\n");
}
__global__ void dummy_kernel(int *data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
data[idx] = idx;
}
int main() {
signal(SIGTERM, sigterm_handler);
int *device_data;
cudaMalloc((void **)&device_data, 1024 * sizeof(int));
dummy_kernel<<<1, 1024>>>(device_data);
cudaDeviceSynchronize();
while(1) {
sleep(1);
printf("Working with GPU...\n");
}
cudaFree(device_data);
return 0;
}
2023년 11월 8일 (수) 오후 5:02, Rémi Palancher <remi at rackslab.io>님이 작성:
> Le 08/11/2023 à 02:28, 김형진 a écrit :
> > Hello ~____
> >
> > …
> >
> > However, as soon as the base QoS job is created, the large QoS job is
> > immediately canceled without any waiting time.____
> >
> > __ __
> >
> > But in the slurmctld log, there is a grace time log.____
> >
> > [2023-11-02T11:37:36.589] debug: setting 3600 sec preemption grace time
> > for JobId=153 to reclaim resources for JobId=154____
> >
> > __ __
> >
> > Could you help me understand what might be going wrong?____
>
> Note that Slurm sends SIGTERM signal by default to slurmstepd immediate
> children (which might be gpu_burn in your case) at _the beginning_ of
> the GraceTime, to notify them of approaching termination.
>
> If the processes react to SIGTERM by terminating, which generally the
> case, you may have the impression GraceTime is not honored.
>
> To benefit from the GraceTime, your program must either trap SIGTERM
> with a signal handler or you must enable send_user_signal
> PreemptParameters flag and submit your job with --signal and another
> signal.
>
> --
> Rémi Palancher
> Rackslab: Open Source Solutions for HPC Operations
> https://rackslab.io/
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231109/26a15412/attachment.htm>
More information about the slurm-users
mailing list