[slurm-users] Users can't scancel

mercan ahmet.mercan at uhem.itu.edu.tr
Wed Nov 18 17:41:17 UTC 2020


Hi;

Check epilog return value which comes from the return value of the last 
line of epilog script. Also, you can add a "exit 0" line at the last 
line of the epilog script to ensure to get a zero return value for 
testing purpose.

Ahmet M.


18.11.2020 20:00 tarihinde William Markuske yazdı:
>
> Hello,
>
> I am having an odd problem where users are unable to kill their jobs 
> with scancel. Users can submit jobs just fine and when the task 
> completes it is able to close correctly. However, if a user attempts 
> to cancel a job via scancel the SIGKILL signals are sent to the step 
> but don't complete. Slurmd then continues to send SIGKILL requests 
> until the UnkillableTimeout is hit, the slurm job is exits with an 
> error, the node enters a draining state, and the spawn processes 
> continue to run on the node.
>
> I'm at a loss because jobs can complete without issue which seems to 
> suggest it's not a networking or permissions issue for the slurm to do 
> job accounting tasks. A user can ssh to the node once a job is 
> submitted and kill the subprocesses manually at which point slurm 
> completes the epilog and the node returns to idle.
>
> Does anyone know what may be causing such behavior? Please let me know 
> any slurm.conf or cgroup.conf settings that would be helpful to 
> diagnose this issue. I'm quite stumped by this one.
>
> -- 
>
> Willy Markuske
>
> HPC Systems Engineer
>
> 	
>
> Research Data Services
>
> P: (858) 246-5593
>



More information about the slurm-users mailing list