[slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

Tue Dec 1 17:23:40 UTC 2020

Hello Robert,

I've been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 
20.02.3. It seems to have started to occur when I enabled 
proctrack/cgroup and changed select/linear to select/con_tres.

Are you using cgroup process tracking and have you manipulated the 
cgroup.conf file? Do jobs complete correctly when not cancelled?

Regards,

Willy Markuske

HPC Systems Engineer

Research Data Services

P: (858) 246-5593

On 11/30/20 10:54 AM, Alex Chekholko wrote:
> This may be more "cargo cult" but I've advised users to add a "sleep 
> 60" to the end of their job scripts if they are "I/O intensive".  
> Sometimes they are somehow able to generate I/O in a way that slurm 
> thinks the job is finished, but the OS is still catching up on the 
> I/O, and then slurm tries to kill the job...
>
> On Mon, Nov 30, 2020 at 10:49 AM Robert Kudyba <rkudyba at fordham.edu 
> <mailto:rkudyba at fordham.edu>> wrote:
>
>     Sure I've seen that in some of the posts here, e.g., a NAS. But in
>     this case it's a NFS share to the local RAID10 storage. There
>     aren't any other settings that deal with this to not drain a node?
>
>     On Mon, Nov 30, 2020 at 1:02 PM Paul Edmon <pedmon at cfa.harvard.edu
>     <mailto:pedmon at cfa.harvard.edu>> wrote:
>
>         That can help.  Usually this happens due to laggy storage the
>         job is
>         using taking time flushing the job's data.  So making sure
>         that your
>         storage is up, responsive, and stable will also cut these down.
>
>         -Paul Edmon-
>
>         On 11/30/2020 12:52 PM, Robert Kudyba wrote:
>         > I've seen where this was a bug that was fixed
>         >
>         https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=>
>
>         >
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=>
>         > but this happens
>         > occasionally still. A user cancels his/her job and a node gets
>         > drained. UnkillableStepTimeout=120 is set in slurm.conf
>         >
>         > Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2
>         >
>         > Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36,
>         CANCELLED,
>         > ExitCode 0
>         > Resending TERMINATE_JOB request JobId=6908 Nodelist=node001
>         > update_node: node node001 reason set to: Kill task failed
>         > update_node: node node001 state set to DRAINING
>         > error: slurmd error running JobId=6908 on node(s)=node001:
>         Kill task
>         > failed
>         >
>         > update_node: node node001 reason set to: hung
>         > update_node: node node001 state set to DOWN
>         > update_node: node node001 state set to IDLE
>         > error: Nodes node001 not responding
>         >
>         > scontrol show config | grep kill
>         > UnkillableStepProgram   = (null)
>         > UnkillableStepTimeout   = 120 sec
>         >
>         > Do we just increase the timeout value?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201201/5e1f85fa/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SDSClogo-plusname-red.jpg
Type: image/jpeg
Size: 9464 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201201/5e1f85fa/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xD42F81D406AC0BA2.asc
Type: application/pgp-keys
Size: 3228 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201201/5e1f85fa/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201201/5e1f85fa/attachment-0001.sig>