[slurm-users] nodes lingering in completion

Henderson, Brent brent.henderson at hpe.com
Fri Apr 1 16:22:09 UTC 2022

Hi slurm experts -

I've gotten temporary access to a cluster with 1k nodes - so of course I setup slurm on it (v20.11.8).  :)  Small jobs are fine and go back to idle rather quickly.  Jobs that use all the nodes will have some 'linger' in the completing state for over a minute while others may take less time - but still noticeable.

Reading some older posts, I see that the epilog is a typical cause for this so I removed it from the config file and indeed, nodes very quickly go back to the idle state after the job completes.  I then created an epilog on each node in /tmp that just contained the bash header and exit 0 and changed my run to be just: 'salloc -N 1024  sleep 10'.  Even with this very simple command and epilog, the nodes exhibit the 'lingering' behavior before returning to idle.

Looking in the slurmd log for one of the nodes that took >60s to go back to idle, I see this:

[2022-03-31T20:57:44.158] Warning: Note very large processing time from prep_epilog: usec=75087286 began=20:56:29.070
[2022-03-31T20:57:44.158] epilog for job 43226 ran for 75 seconds

I tried upping the debug level on the slurmd side but didn't see anything useful.

So, I guess I have a couple questions:
- anyone seen this behavior before and know a fix?  :)
- might this issue be resolved in 21.08?  (Didn't see anything in the release note that talked about the epilog.)
- thoughts on how to collect some additional information on what might be happening on the system to slow down the epilog?



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220401/b1cb1ccd/attachment.htm>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: slurm_conf.txt
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220401/b1cb1ccd/attachment.txt>

More information about the slurm-users mailing list