[slurm-users] job_time_limit: inactivity time limit reached ...
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Wed Sep 21 07:49:33 UTC 2022
Hi Paul,
IMHO, using logrotate is the most convenient method for making daily
database backup dumps and keep a number of backup versions, see the notes in
https://wiki.fysik.dtu.dk/niflheim/Slurm_database#backup-script-with-logrotate
Using --single-transaction is recommended by SchedMD to avoid race
conditions when slurmdbd is being run while taking the MySQL dump, see
https://bugs.schedmd.com/show_bug.cgi?id=10295#c18
/Ole
On 9/20/22 15:17, Paul Raines wrote:
>
> Further investigation found that I had setup logrotate to handle a mysql
> dump
>
> mysqldump -R --single-transaction -B slurm_db | bzip2
>
> which is what is taking 5 minutes. I think this is locking tables during
> the time hanging calls to slurmdbd most likely and causing the issue.
> I will need to rework it.
>
> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>
>
>
> On Mon, 19 Sep 2022 9:29am, Reed Dier wrote:
>
>> I’m not sure if this might be helpful, but my logrotate.d for slurm
>> looks a bit differently, namely instead of a systemctl reload, I am
>> sending a specific SIGUSR2 signal, which is supposedly for the specific
>> purpose of logrotation in slurm.
>>
>>> postrotate
>>> pkill -x --signal SIGUSR2 slurmctld
>>> pkill -x --signal SIGUSR2 slurmd
>>> pkill -x --signal SIGUSR2 slurmdbd
>>> exit 0
>>> endscript
>>
>> I would take a look here: https://slurm.schedmd.com/slurm.conf.html#lbAQ
>> <https://slurm.schedmd.com/slurm.conf.html#lbAQ>
>>
>> Reed
>>
>>> On Sep 19, 2022, at 7:46 AM, Paul Raines <raines at nmr.mgh.harvard.edu>
>>> wrote:
>>>
>>>
>>> I have had two nights where right at 3:35am a bunch of jobs were
>>> killed early with TIMEOUT way before their normal TimeLimit.
>>> The slurmctld log has lots of lines like at 3:35am with
>>>
>>> [2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached
>>> for JobId=1636922
>>>
>>> with jobs running on serveral different nodes.
>>>
>>> The one curious thing is right about this time log rotation is happening
>>> in cron on the slurmctld master node
>>>
>>> Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily)
>>> starting logrotate
>>> Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily)
>>> finished logrotate
>>>
>>> The 5 minute runtime here is a big anomoly. On other machines, like
>>> nodes just running slurmd or my web servers, this only takes a couple
>>> of seconds.
>>>
>>> In /etc/logrotate.d/slurmctl I have
>>>
>>> postrotate
>>> systemctl reload slurmdbd >/dev/null 2>/dev/null || true
>>> /bin/sleep 1
>>> systemctl reload slurmctld >/dev/null 2>/dev/null || true
>>> endscript
>>>
>>> Does it make sense that this could be causing the issue?
>>>
>>> In slurm.conf I had InactiveLimit=60 which I guess is what is happening
>>> but my reading of the docs on this setting was it only affects the
>>> starting of a job with srun/salloc and not a job that has been running
>>> for days. Is it InactiveLimit that leads to the "inactivity time limit
>>> reached" message?
>>>
>>> Anyway, I have changed InactiveLimit=600 to see if that helps.
More information about the slurm-users
mailing list