<div dir="ltr">This may be more "cargo cult" but I've advised users to add a "sleep 60" to the end of their job scripts if they are "I/O intensive".  Sometimes they are somehow able to generate I/O in a way that slurm thinks the job is finished, but the OS is still catching up on the I/O, and then slurm tries to kill the job...</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Nov 30, 2020 at 10:49 AM Robert Kudyba <<a href="mailto:rkudyba@fordham.edu">rkudyba@fordham.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Sure I've seen that in some of the posts here, e.g., a NAS. But in this case it's a NFS share to the local RAID10 storage. There aren't any other settings that deal with this to not drain a node?</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Nov 30, 2020 at 1:02 PM Paul Edmon <<a href="mailto:pedmon@cfa.harvard.edu" target="_blank">pedmon@cfa.harvard.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">That can help.  Usually this happens due to laggy storage the job is <br>

using taking time flushing the job's data.  So making sure that your <br>

storage is up, responsive, and stable will also cut these down.<br>

<br>

-Paul Edmon-<br>

<br>

On 11/30/2020 12:52 PM, Robert Kudyba wrote:<br>

> I've seen where this was a bug that was fixed <br>

> <a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=</a>  <br>

> <<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=</a> > but this happens <br>

> occasionally still. A user cancels his/her job and a node gets <br>

> drained. UnkillableStepTimeout=120 is set in slurm.conf<br>

><br>

> Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2<br>

><br>

> Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED, <br>

> ExitCode 0<br>

> Resending TERMINATE_JOB request JobId=6908 Nodelist=node001<br>

> update_node: node node001 reason set to: Kill task failed<br>

> update_node: node node001 state set to DRAINING<br>

> error: slurmd error running JobId=6908 on node(s)=node001: Kill task <br>

> failed<br>

><br>

> update_node: node node001 reason set to: hung<br>

> update_node: node node001 state set to DOWN<br>

> update_node: node node001 state set to IDLE<br>

> error: Nodes node001 not responding<br>

><br>

> scontrol show config | grep kill<br>

> UnkillableStepProgram   = (null)<br>

> UnkillableStepTimeout   = 120 sec<br>

><br>

> Do we just increase the timeout value?<br>

<br>

</blockquote></div>

</blockquote></div>