[slurm-users] job_time_limit: inactivity time limit reached ...

Reed Dier reed.dier at focusvq.com
Mon Sep 19 13:29:38 UTC 2022


I’m not sure if this might be helpful, but my logrotate.d for slurm looks a bit differently, namely instead of a systemctl reload, I am sending a specific SIGUSR2 signal, which is supposedly for the specific purpose of logrotation in slurm.

>     postrotate
>             pkill -x --signal SIGUSR2 slurmctld
>             pkill -x --signal SIGUSR2 slurmd
>             pkill -x --signal SIGUSR2 slurmdbd
>             exit 0
>     endscript

I would take a look here: https://slurm.schedmd.com/slurm.conf.html#lbAQ <https://slurm.schedmd.com/slurm.conf.html#lbAQ>

Reed

> On Sep 19, 2022, at 7:46 AM, Paul Raines <raines at nmr.mgh.harvard.edu> wrote:
> 
> 
> I have had two nights where right at 3:35am a bunch of jobs were
> killed early with TIMEOUT way before  their normal TimeLimit.
> The slurmctld log has lots of lines like at 3:35am with
> 
> [2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached for JobId=1636922
> 
> with jobs running on serveral different nodes.
> 
> The one curious thing is right about this time log rotation is happening
> in cron on the slurmctld master node
> 
> Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily) starting logrotate
> Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily) finished logrotate
> 
> The 5 minute runtime here is a big anomoly.  On other machines, like
> nodes just running slurmd or my web servers, this only takes a couple of seconds.
> 
> In /etc/logrotate.d/slurmctl I have
> 
>   postrotate
>     systemctl reload slurmdbd >/dev/null 2>/dev/null || true
>     /bin/sleep 1
>     systemctl reload slurmctld >/dev/null 2>/dev/null || true
>   endscript
> 
> Does it make sense that this could be causing the issue?
> 
> In slurm.conf I had InactiveLimit=60 which I guess is what is happening
> but my reading of the docs on this setting was it only affects the
> starting of a job with srun/salloc and not a job that has been running
> for days.  Is it InactiveLimit that leads to the "inactivity time limit reached" message?
> 
> Anyway, I have changed InactiveLimit=600 to see if that helps.
> 
> 
> ---------------------------------------------------------------
> Paul Raines                     http://help.nmr.mgh.harvard.edu
> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
> 149 (2301) 13th Street     Charlestown, MA 02129	    USA
> 
> 
> 
> The information in this e-mail is intended only for the person to whom it is addressed.  If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
> Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail. 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220919/82e31a4b/attachment.htm>


More information about the slurm-users mailing list