[slurm-users] nodes lingering in completion

William Brown william at signalbox.org.uk
Fri Apr 1 17:33:01 UTC 2022


To process the epilog a Bash process must be created so perhaps look at
.bashrc.

Try timing running the epilog yourself on a compute node.  I presume it is
owned by an account local to the compute nodes, not a directory service
account?

William

On Fri, 1 Apr 2022, 17:25 Henderson, Brent, <brent.henderson at hpe.com> wrote:

> Hi slurm experts -
>
>
>
> I’ve gotten temporary access to a cluster with 1k nodes - so of course I
> setup slurm on it (v20.11.8).  J  Small jobs are fine and go back to idle
> rather quickly.  Jobs that use all the nodes will have some ‘linger’ in the
> completing state for over a minute while others may take less time - but
> still noticeable.
>
>
>
> Reading some older posts, I see that the epilog is a typical cause for
> this so I removed it from the config file and indeed, nodes very quickly go
> back to the idle state after the job completes.  I then created an epilog
> on each node in /tmp that just contained the bash header and exit 0 and
> changed my run to be just: ‘salloc -N 1024  sleep 10’.  Even with this very
> simple command and epilog, the nodes exhibit the ‘lingering’ behavior
> before returning to idle.
>
>
>
> Looking in the slurmd log for one of the nodes that took >60s to go back
> to idle, I see this:
>
>
>
> [2022-03-31T20:57:44.158] Warning: Note very large processing time from
> prep_epilog: usec=75087286 began=20:56:29.070
>
> [2022-03-31T20:57:44.158] epilog for job 43226 ran for 75 seconds
>
>
>
> I tried upping the debug level on the slurmd side but didn’t see anything
> useful.
>
>
>
> So, I guess I have a couple questions:
>
> - anyone seen this behavior before and know a fix?  :)
>
> - might this issue be resolved in 21.08?  (Didn’t see anything in the
> release note that talked about the epilog.)
>
> - thoughts on how to collect some additional information on what might be
> happening on the system to slow down the epilog?
>
>
>
> Thanks,
>
>
>
> Brent
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220401/d96555b7/attachment.htm>


More information about the slurm-users mailing list