[slurm-users] job_time_limit: inactivity time limit reached ...
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Wed Sep 21 12:43:07 UTC 2022
Hi Paul,
Interesting observation on the execution time and the pipe! How do you
ensure that you have enough disk space for the uncompressed database dump?
Maybe using /dev/shmem?
The lbzip2 mentioned in the link below is significantly faster than bzip2.
Best regards,
Ole
On 9/21/22 14:38, Paul Raines wrote:
>
> Almost all the 5 min+ time was in the bzip2. The mysqldump by itself was
> about 16 seconds. So I moved the bzip2 to its own separate line so
> the tables are only locked for the ~16 seconds
>
> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>
>
>
> On Wed, 21 Sep 2022 3:49am, Ole Holm Nielsen wrote:
>
>> Hi Paul,
>>
>> IMHO, using logrotate is the most convenient method for making daily
>> database backup dumps and keep a number of backup versions, see the
>> notes in
>> https://wiki.fysik.dtu.dk/niflheim/Slurm_database#backup-script-with-logrotate
>>
>>
>> Using --single-transaction is recommended by SchedMD to avoid race
>> conditions when slurmdbd is being run while taking the MySQL dump, see
>> https://bugs.schedmd.com/show_bug.cgi?id=10295#c18
>>
>> /Ole
>>
>> On 9/20/22 15:17, Paul Raines wrote:
>>>
>>> Further investigation found that I had setup logrotate to handle a mysql
>>> dump
>>>
>>> mysqldump -R --single-transaction -B slurm_db | bzip2
>>>
>>> which is what is taking 5 minutes. I think this is locking tables during
>>> the time hanging calls to slurmdbd most likely and causing the issue.
>>> I will need to rework it.
>>>
>>> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>>>
>>>
>>>
>>> On Mon, 19 Sep 2022 9:29am, Reed Dier wrote:
>>>
>>>> I’m not sure if this might be helpful, but my logrotate.d for slurm
>>>> looks
>>>> a bit differently, namely instead of a systemctl reload, I am sending a
>>>> specific SIGUSR2 signal, which is supposedly for the specific purpose of
>>>> logrotation in slurm.
>>>>
>>>>> postrotate
>>>>> pkill -x --signal SIGUSR2 slurmctld
>>>>> pkill -x --signal SIGUSR2 slurmd
>>>>> pkill -x --signal SIGUSR2 slurmdbd
>>>>> exit 0
>>>>> endscript
>>>>
>>>> I would take a look here: https://slurm.schedmd.com/slurm.conf.html#lbAQ
>>>> <https://slurm.schedmd.com/slurm.conf.html#lbAQ>
>>>>
>>>> Reed
>>>>
>>>>> On Sep 19, 2022, at 7:46 AM, Paul Raines <raines at nmr.mgh.harvard.edu>
>>>>> wrote:
>>>>>
>>>>>
>>>>> I have had two nights where right at 3:35am a bunch of jobs were
>>>>> killed early with TIMEOUT way before their normal TimeLimit.
>>>>> The slurmctld log has lots of lines like at 3:35am with
>>>>>
>>>>> [2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached
>>>>> for JobId=1636922
>>>>>
>>>>> with jobs running on serveral different nodes.
>>>>>
>>>>> The one curious thing is right about this time log rotation is
>>>>> happening
>>>>> in cron on the slurmctld master node
>>>>>
>>>>> Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily)
>>>>> starting
>>>>> logrotate
>>>>> Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily)
>>>>> finished
>>>>> logrotate
>>>>>
>>>>> The 5 minute runtime here is a big anomoly. On other machines, like
>>>>> nodes just running slurmd or my web servers, this only takes a
>>>>> couple of
>>>>> seconds.
>>>>>
>>>>> In /etc/logrotate.d/slurmctl I have
>>>>>
>>>>> postrotate
>>>>> systemctl reload slurmdbd >/dev/null 2>/dev/null || true
>>>>> /bin/sleep 1
>>>>> systemctl reload slurmctld >/dev/null 2>/dev/null || true
>>>>> endscript
>>>>>
>>>>> Does it make sense that this could be causing the issue?
>>>>>
>>>>> In slurm.conf I had InactiveLimit=60 which I guess is what is happening
>>>>> but my reading of the docs on this setting was it only affects the
>>>>> starting of a job with srun/salloc and not a job that has been running
>>>>> for days. Is it InactiveLimit that leads to the "inactivity time limit
>>>>> reached" message?
>>>>>
>>>>> Anyway, I have changed InactiveLimit=600 to see if that helps.
More information about the slurm-users
mailing list