[slurm-users] job_time_limit: inactivity time limit reached ...
Paul Raines
raines at nmr.mgh.harvard.edu
Wed Sep 21 12:38:22 UTC 2022
Almost all the 5 min+ time was in the bzip2. The mysqldump by itself was
about 16 seconds. So I moved the bzip2 to its own separate line so
the tables are only locked for the ~16 seconds
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Wed, 21 Sep 2022 3:49am, Ole Holm Nielsen wrote:
> Hi Paul,
>
> IMHO, using logrotate is the most convenient method for making daily database
> backup dumps and keep a number of backup versions, see the notes in
> https://wiki.fysik.dtu.dk/niflheim/Slurm_database#backup-script-with-logrotate
>
> Using --single-transaction is recommended by SchedMD to avoid race conditions
> when slurmdbd is being run while taking the MySQL dump, see
> https://bugs.schedmd.com/show_bug.cgi?id=10295#c18
>
> /Ole
>
> On 9/20/22 15:17, Paul Raines wrote:
>>
>> Further investigation found that I had setup logrotate to handle a mysql
>> dump
>>
>> mysqldump -R --single-transaction -B slurm_db | bzip2
>>
>> which is what is taking 5 minutes. I think this is locking tables during
>> the time hanging calls to slurmdbd most likely and causing the issue.
>> I will need to rework it.
>>
>> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>>
>>
>>
>> On Mon, 19 Sep 2022 9:29am, Reed Dier wrote:
>>
>>> I’m not sure if this might be helpful, but my logrotate.d for slurm looks
>>> a bit differently, namely instead of a systemctl reload, I am sending a
>>> specific SIGUSR2 signal, which is supposedly for the specific purpose of
>>> logrotation in slurm.
>>>
>>>> postrotate
>>>> pkill -x --signal SIGUSR2 slurmctld
>>>> pkill -x --signal SIGUSR2 slurmd
>>>> pkill -x --signal SIGUSR2 slurmdbd
>>>> exit 0
>>>> endscript
>>>
>>> I would take a look here: https://slurm.schedmd.com/slurm.conf.html#lbAQ
>>> <https://slurm.schedmd.com/slurm.conf.html#lbAQ>
>>>
>>> Reed
>>>
>>>> On Sep 19, 2022, at 7:46 AM, Paul Raines <raines at nmr.mgh.harvard.edu>
>>>> wrote:
>>>>
>>>>
>>>> I have had two nights where right at 3:35am a bunch of jobs were
>>>> killed early with TIMEOUT way before their normal TimeLimit.
>>>> The slurmctld log has lots of lines like at 3:35am with
>>>>
>>>> [2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached
>>>> for JobId=1636922
>>>>
>>>> with jobs running on serveral different nodes.
>>>>
>>>> The one curious thing is right about this time log rotation is happening
>>>> in cron on the slurmctld master node
>>>>
>>>> Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily) starting
>>>> logrotate
>>>> Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily) finished
>>>> logrotate
>>>>
>>>> The 5 minute runtime here is a big anomoly. On other machines, like
>>>> nodes just running slurmd or my web servers, this only takes a couple of
>>>> seconds.
>>>>
>>>> In /etc/logrotate.d/slurmctl I have
>>>>
>>>> postrotate
>>>> systemctl reload slurmdbd >/dev/null 2>/dev/null || true
>>>> /bin/sleep 1
>>>> systemctl reload slurmctld >/dev/null 2>/dev/null || true
>>>> endscript
>>>>
>>>> Does it make sense that this could be causing the issue?
>>>>
>>>> In slurm.conf I had InactiveLimit=60 which I guess is what is happening
>>>> but my reading of the docs on this setting was it only affects the
>>>> starting of a job with srun/salloc and not a job that has been running
>>>> for days. Is it InactiveLimit that leads to the "inactivity time limit
>>>> reached" message?
>>>>
>>>> Anyway, I have changed InactiveLimit=600 to see if that helps.
>
>
>
>
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
More information about the slurm-users
mailing list