<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">I’m not sure if this might be helpful, but my logrotate.d for slurm looks a bit differently, namely instead of a systemctl reload, I am sending a specific SIGUSR2 signal, which is supposedly for the specific purpose of logrotation in slurm.<div class=""><br class=""></div><div class=""><blockquote type="cite" class=""><font face="Menlo" class=""> postrotate<br class=""> pkill -x --signal SIGUSR2 slurmctld<br class=""> pkill -x --signal SIGUSR2 slurmd<br class=""> pkill -x --signal SIGUSR2 slurmdbd<br class=""> exit 0<br class=""> endscript</font><br class=""></blockquote><div class=""><br class=""></div>I would take a look here: <a href="https://slurm.schedmd.com/slurm.conf.html#lbAQ" class="">https://slurm.schedmd.com/slurm.conf.html#lbAQ</a></div><div class=""><br class=""></div><div class="">Reed<br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Sep 19, 2022, at 7:46 AM, Paul Raines <<a href="mailto:raines@nmr.mgh.harvard.edu" class="">raines@nmr.mgh.harvard.edu</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class=""><br class="">I have had two nights where right at 3:35am a bunch of jobs were<br class="">killed early with TIMEOUT way before their normal TimeLimit.<br class="">The slurmctld log has lots of lines like at 3:35am with<br class=""><br class="">[2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached for JobId=1636922<br class=""><br class="">with jobs running on serveral different nodes.<br class=""><br class="">The one curious thing is right about this time log rotation is happening<br class="">in cron on the slurmctld master node<br class=""><br class="">Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily) starting logrotate<br class="">Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily) finished logrotate<br class=""><br class="">The 5 minute runtime here is a big anomoly. On other machines, like<br class="">nodes just running slurmd or my web servers, this only takes a couple of seconds.<br class=""><br class="">In /etc/logrotate.d/slurmctl I have<br class=""><br class=""> postrotate<br class=""> systemctl reload slurmdbd >/dev/null 2>/dev/null || true<br class=""> /bin/sleep 1<br class=""> systemctl reload slurmctld >/dev/null 2>/dev/null || true<br class=""> endscript<br class=""><br class="">Does it make sense that this could be causing the issue?<br class=""><br class="">In slurm.conf I had InactiveLimit=60 which I guess is what is happening<br class="">but my reading of the docs on this setting was it only affects the<br class="">starting of a job with srun/salloc and not a job that has been running<br class="">for days. Is it InactiveLimit that leads to the "inactivity time limit reached" message?<br class=""><br class="">Anyway, I have changed InactiveLimit=600 to see if that helps.<br class=""><br class=""><br class="">---------------------------------------------------------------<br class="">Paul Raines <a href="http://help.nmr.mgh.harvard.edu" class="">http://help.nmr.mgh.harvard.edu</a><br class="">MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging<br class="">149 (2301) 13th Street Charlestown, MA 02129<span class="Apple-tab-span" style="white-space:pre"> </span> USA<br class=""><br class=""><br class=""><br class="">The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at <a href="https://www.massgeneralbrigham.org/complianceline" class="">https://www.massgeneralbrigham.org/complianceline</a> <<a href="https://www.massgeneralbrigham.org/complianceline" class="">https://www.massgeneralbrigham.org/complianceline</a>> .<br class="">Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail. <br class=""><br class=""></div></div></blockquote></div><br class=""></div></body></html>