[slurm-users] job_time_limit: inactivity time limit reached ...
Brian Andrus
toomuchit at gmail.com
Mon Sep 19 13:56:01 UTC 2022
Paul,
You are likely spot on with the inactiveLimit change. It may also be an
environment variable of TMOUT (under bash) set.
Brian Andrus
On 9/19/2022 5:46 AM, Paul Raines wrote:
>
> I have had two nights where right at 3:35am a bunch of jobs were
> killed early with TIMEOUT way before their normal TimeLimit.
> The slurmctld log has lots of lines like at 3:35am with
>
> [2022-09-12T03:35:02.303] job_time_limit: inactivity time limit
> reached for JobId=1636922
>
> with jobs running on serveral different nodes.
>
> The one curious thing is right about this time log rotation is happening
> in cron on the slurmctld master node
>
> Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily)
> starting logrotate
> Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily)
> finished logrotate
>
> The 5 minute runtime here is a big anomoly. On other machines, like
> nodes just running slurmd or my web servers, this only takes a couple
> of seconds.
>
> In /etc/logrotate.d/slurmctl I have
>
> postrotate
> systemctl reload slurmdbd >/dev/null 2>/dev/null || true
> /bin/sleep 1
> systemctl reload slurmctld >/dev/null 2>/dev/null || true
> endscript
>
> Does it make sense that this could be causing the issue?
>
> In slurm.conf I had InactiveLimit=60 which I guess is what is happening
> but my reading of the docs on this setting was it only affects the
> starting of a job with srun/salloc and not a job that has been running
> for days. Is it InactiveLimit that leads to the "inactivity time
> limit reached" message?
>
> Anyway, I have changed InactiveLimit=600 to see if that helps.
>
>
> ---------------------------------------------------------------
> Paul Raines http://help.nmr.mgh.harvard.edu
> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
> 149 (2301) 13th Street Charlestown, MA 02129 USA
>
>
>
> The information in this e-mail is intended only for the person to whom
> it is addressed. If you believe this e-mail was sent to you in error
> and the e-mail contains patient information, please contact the Mass
> General Brigham Compliance HelpLine at
> https://www.massgeneralbrigham.org/complianceline
> <https://www.massgeneralbrigham.org/complianceline> .
> Please note that this e-mail is not secure (encrypted). If you do not
> wish to continue communication over unencrypted e-mail, please notify
> the sender of this message immediately. Continuing to send or respond
> to e-mail after receiving this message means you understand and accept
> this risk and wish to continue to communicate over unencrypted e-mail.
>
More information about the slurm-users
mailing list