<div dir="auto">To process the epilog a Bash process must be created so perhaps look at .bashrc.<div dir="auto"><br></div><div dir="auto">Try timing running the epilog yourself on a compute node.  I presume it is owned by an account local to the compute nodes, not a directory service account?<br><div dir="auto"><br></div><div dir="auto">William </div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 1 Apr 2022, 17:25 Henderson, Brent, <<a href="mailto:brent.henderson@hpe.com">brent.henderson@hpe.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div lang="EN-US" link="#0563C1" vlink="#954F72">

<div class="m_-1690225794852522822WordSection1">

<p class="MsoNormal">Hi slurm experts - <u></u><u></u></p>

<p class="MsoNormal"><u></u> <u></u></p>

<p class="MsoNormal">I’ve gotten temporary access to a cluster with 1k nodes - so of course I setup slurm on it (v20.11.8). 

<span style="font-family:Wingdings">J</span>  Small jobs are fine and go back to idle rather quickly.  Jobs that use all the nodes will have some ‘linger’ in the completing state for over a minute while others may take less time - but still noticeable.<u></u><u></u></p>

<p class="MsoNormal"><u></u> <u></u></p>

<p class="MsoNormal">Reading some older posts, I see that the epilog is a typical cause for this so I removed it from the config file and indeed, nodes very quickly go back to the idle state after the job completes.  I then created an epilog on each node in

 /tmp that just contained the bash header and exit 0 and changed my run to be just: ‘salloc -N 1024  sleep 10’.  Even with this very simple command and epilog, the nodes exhibit the ‘lingering’ behavior before returning to idle.<u></u><u></u></p>

<p class="MsoNormal"><u></u> <u></u></p>

<p class="MsoNormal">Looking in the slurmd log for one of the nodes that took >60s to go back to idle, I see this:<u></u><u></u></p>

<p class="MsoNormal"><u></u> <u></u></p>

<p class="MsoNormal">[2022-03-31T20:57:44.158] Warning: Note very large processing time from prep_epilog: usec=75087286 began=20:56:29.070<u></u><u></u></p>

<p class="MsoNormal">[2022-03-31T20:57:44.158] epilog for job 43226 ran for 75 seconds<u></u><u></u></p>

<p class="MsoNormal"><u></u> <u></u></p>

<p class="MsoNormal">I tried upping the debug level on the slurmd side but didn’t see anything useful. 

<u></u><u></u></p>

<p class="MsoNormal"><u></u> <u></u></p>

<p class="MsoNormal">So, I guess I have a couple questions:<u></u><u></u></p>

<p class="MsoNormal">- anyone seen this behavior before and know a fix?  :)<u></u><u></u></p>

<p class="MsoNormal">- might this issue be resolved in 21.08?  (Didn’t see anything in the release note that talked about the epilog.)<u></u><u></u></p>

<p class="MsoNormal">- thoughts on how to collect some additional information on what might be happening on the system to slow down the epilog?<u></u><u></u></p>

<p class="MsoNormal"><u></u> <u></u></p>

<p class="MsoNormal">Thanks,<u></u><u></u></p>

<p class="MsoNormal"><u></u> <u></u></p>

<p class="MsoNormal">Brent<u></u><u></u></p>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

</div>


</blockquote></div>