[slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

Mon Nov 30 18:54:26 UTC 2020

This may be more "cargo cult" but I've advised users to add a "sleep 60" to
the end of their job scripts if they are "I/O intensive".  Sometimes they
are somehow able to generate I/O in a way that slurm thinks the job is
finished, but the OS is still catching up on the I/O, and then slurm tries
to kill the job...

On Mon, Nov 30, 2020 at 10:49 AM Robert Kudyba <rkudyba at fordham.edu> wrote:

> Sure I've seen that in some of the posts here, e.g., a NAS. But in this
> case it's a NFS share to the local RAID10 storage. There aren't any other
> settings that deal with this to not drain a node?
>
> On Mon, Nov 30, 2020 at 1:02 PM Paul Edmon <pedmon at cfa.harvard.edu> wrote:
>
>> That can help.  Usually this happens due to laggy storage the job is
>> using taking time flushing the job's data.  So making sure that your
>> storage is up, responsive, and stable will also cut these down.
>>
>> -Paul Edmon-
>>
>> On 11/30/2020 12:52 PM, Robert Kudyba wrote:
>> > I've seen where this was a bug that was fixed
>> >
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=
>>
>> > <
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=
>> > but this happens
>> > occasionally still. A user cancels his/her job and a node gets
>> > drained. UnkillableStepTimeout=120 is set in slurm.conf
>> >
>> > Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2
>> >
>> > Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED,
>> > ExitCode 0
>> > Resending TERMINATE_JOB request JobId=6908 Nodelist=node001
>> > update_node: node node001 reason set to: Kill task failed
>> > update_node: node node001 state set to DRAINING
>> > error: slurmd error running JobId=6908 on node(s)=node001: Kill task
>> > failed
>> >
>> > update_node: node node001 reason set to: hung
>> > update_node: node node001 state set to DOWN
>> > update_node: node node001 state set to IDLE
>> > error: Nodes node001 not responding
>> >
>> > scontrol show config | grep kill
>> > UnkillableStepProgram   = (null)
>> > UnkillableStepTimeout   = 120 sec
>> >
>> > Do we just increase the timeout value?
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201130/3a1c2a03/attachment.htm>