[slurm-users] Users can't scancel
mercan
ahmet.mercan at uhem.itu.edu.tr
Wed Nov 18 17:41:17 UTC 2020
Hi;
Check epilog return value which comes from the return value of the last
line of epilog script. Also, you can add a "exit 0" line at the last
line of the epilog script to ensure to get a zero return value for
testing purpose.
Ahmet M.
18.11.2020 20:00 tarihinde William Markuske yazdı:
>
> Hello,
>
> I am having an odd problem where users are unable to kill their jobs
> with scancel. Users can submit jobs just fine and when the task
> completes it is able to close correctly. However, if a user attempts
> to cancel a job via scancel the SIGKILL signals are sent to the step
> but don't complete. Slurmd then continues to send SIGKILL requests
> until the UnkillableTimeout is hit, the slurm job is exits with an
> error, the node enters a draining state, and the spawn processes
> continue to run on the node.
>
> I'm at a loss because jobs can complete without issue which seems to
> suggest it's not a networking or permissions issue for the slurm to do
> job accounting tasks. A user can ssh to the node once a job is
> submitted and kill the subprocesses manually at which point slurm
> completes the epilog and the node returns to idle.
>
> Does anyone know what may be causing such behavior? Please let me know
> any slurm.conf or cgroup.conf settings that would be helpful to
> diagnose this issue. I'm quite stumped by this one.
>
> --
>
> Willy Markuske
>
> HPC Systems Engineer
>
>
>
> Research Data Services
>
> P: (858) 246-5593
>
More information about the slurm-users
mailing list