[slurm-users] job_time_limit: inactivity time limit reached ...

Wed Sep 21 07:49:33 UTC 2022

Hi Paul,

IMHO, using logrotate is the most convenient method for making daily 
database backup dumps and keep a number of backup versions, see the notes in
https://wiki.fysik.dtu.dk/niflheim/Slurm_database#backup-script-with-logrotate

Using --single-transaction is recommended by SchedMD to avoid race 
conditions when slurmdbd is being run while taking the MySQL dump, see
https://bugs.schedmd.com/show_bug.cgi?id=10295#c18

/Ole

On 9/20/22 15:17, Paul Raines wrote:
> 
> Further investigation found that I had setup logrotate to handle a mysql
> dump
> 
>    mysqldump -R --single-transaction -B slurm_db | bzip2
> 
> which is what is taking 5 minutes.  I think this is locking tables during
> the time hanging calls to slurmdbd most likely and causing the issue.
> I will need to rework it.
> 
> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
> 
> 
> 
> On Mon, 19 Sep 2022 9:29am, Reed Dier wrote:
> 
>> I’m not sure if this might be helpful, but my logrotate.d for slurm 
>> looks a bit differently, namely instead of a systemctl reload, I am 
>> sending a specific SIGUSR2 signal, which is supposedly for the specific 
>> purpose of logrotation in slurm.
>>
>>>     postrotate
>>>             pkill -x --signal SIGUSR2 slurmctld
>>>             pkill -x --signal SIGUSR2 slurmd
>>>             pkill -x --signal SIGUSR2 slurmdbd
>>>             exit 0
>>>     endscript
>>
>> I would take a look here: https://slurm.schedmd.com/slurm.conf.html#lbAQ 
>> <https://slurm.schedmd.com/slurm.conf.html#lbAQ>
>>
>> Reed
>>
>>> On Sep 19, 2022, at 7:46 AM, Paul Raines <raines at nmr.mgh.harvard.edu> 
>>> wrote:
>>>
>>>
>>> I have had two nights where right at 3:35am a bunch of jobs were
>>> killed early with TIMEOUT way before  their normal TimeLimit.
>>> The slurmctld log has lots of lines like at 3:35am with
>>>
>>> [2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached 
>>> for JobId=1636922
>>>
>>> with jobs running on serveral different nodes.
>>>
>>> The one curious thing is right about this time log rotation is happening
>>> in cron on the slurmctld master node
>>>
>>> Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily) 
>>> starting logrotate
>>> Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily) 
>>> finished logrotate
>>>
>>> The 5 minute runtime here is a big anomoly.  On other machines, like
>>> nodes just running slurmd or my web servers, this only takes a couple 
>>> of seconds.
>>>
>>> In /etc/logrotate.d/slurmctl I have
>>>
>>>   postrotate
>>>     systemctl reload slurmdbd >/dev/null 2>/dev/null || true
>>>     /bin/sleep 1
>>>     systemctl reload slurmctld >/dev/null 2>/dev/null || true
>>>   endscript
>>>
>>> Does it make sense that this could be causing the issue?
>>>
>>> In slurm.conf I had InactiveLimit=60 which I guess is what is happening
>>> but my reading of the docs on this setting was it only affects the
>>> starting of a job with srun/salloc and not a job that has been running
>>> for days.  Is it InactiveLimit that leads to the "inactivity time limit 
>>> reached" message?
>>>
>>> Anyway, I have changed InactiveLimit=600 to see if that helps.