[slurm-users] job_time_limit: inactivity time limit reached ...

Wed Sep 21 12:43:07 UTC 2022

Hi Paul,

Interesting observation on the execution time and the pipe!  How do you 
ensure that you have enough disk space for the uncompressed database dump? 
  Maybe using /dev/shmem?

The lbzip2 mentioned in the link below is significantly faster than bzip2.

Best regards,
Ole

On 9/21/22 14:38, Paul Raines wrote:
> 
> Almost all the 5 min+ time was in the bzip2.  The mysqldump by itself was 
> about 16 seconds.  So I moved the bzip2 to its own separate line so
> the tables are only locked for the ~16 seconds
> 
> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
> 
> 
> 
> On Wed, 21 Sep 2022 3:49am, Ole Holm Nielsen wrote:
> 
>> Hi Paul,
>>
>> IMHO, using logrotate is the most convenient method for making daily 
>> database backup dumps and keep a number of backup versions, see the 
>> notes in
>> https://wiki.fysik.dtu.dk/niflheim/Slurm_database#backup-script-with-logrotate 
>>
>>
>> Using --single-transaction is recommended by SchedMD to avoid race 
>> conditions when slurmdbd is being run while taking the MySQL dump, see
>> https://bugs.schedmd.com/show_bug.cgi?id=10295#c18
>>
>> /Ole
>>
>> On 9/20/22 15:17, Paul Raines wrote:
>>>
>>>  Further investigation found that I had setup logrotate to handle a mysql
>>>  dump
>>>
>>>     mysqldump -R --single-transaction -B slurm_db | bzip2
>>>
>>>  which is what is taking 5 minutes.  I think this is locking tables during
>>>  the time hanging calls to slurmdbd most likely and causing the issue.
>>>  I will need to rework it.
>>>
>>>  -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>>>
>>>
>>>
>>>  On Mon, 19 Sep 2022 9:29am, Reed Dier wrote:
>>>
>>>>  I’m not sure if this might be helpful, but my logrotate.d for slurm 
>>>> looks
>>>>  a bit differently, namely instead of a systemctl reload, I am sending a
>>>>  specific SIGUSR2 signal, which is supposedly for the specific purpose of
>>>>  logrotation in slurm.
>>>>
>>>>>      postrotate
>>>>>              pkill -x --signal SIGUSR2 slurmctld
>>>>>              pkill -x --signal SIGUSR2 slurmd
>>>>>              pkill -x --signal SIGUSR2 slurmdbd
>>>>>              exit 0
>>>>>      endscript
>>>>
>>>>  I would take a look here: https://slurm.schedmd.com/slurm.conf.html#lbAQ
>>>>  <https://slurm.schedmd.com/slurm.conf.html#lbAQ>
>>>>
>>>>  Reed
>>>>
>>>>>  On Sep 19, 2022, at 7:46 AM, Paul Raines <raines at nmr.mgh.harvard.edu>
>>>>>  wrote:
>>>>>
>>>>>
>>>>>  I have had two nights where right at 3:35am a bunch of jobs were
>>>>>  killed early with TIMEOUT way before  their normal TimeLimit.
>>>>>  The slurmctld log has lots of lines like at 3:35am with
>>>>>
>>>>>  [2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached
>>>>>  for JobId=1636922
>>>>>
>>>>>  with jobs running on serveral different nodes.
>>>>>
>>>>>  The one curious thing is right about this time log rotation is 
>>>>> happening
>>>>>  in cron on the slurmctld master node
>>>>>
>>>>>  Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily) 
>>>>> starting
>>>>>  logrotate
>>>>>  Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily) 
>>>>> finished
>>>>>  logrotate
>>>>>
>>>>>  The 5 minute runtime here is a big anomoly.  On other machines, like
>>>>>  nodes just running slurmd or my web servers, this only takes a 
>>>>> couple of
>>>>>  seconds.
>>>>>
>>>>>  In /etc/logrotate.d/slurmctl I have
>>>>>
>>>>>    postrotate
>>>>>      systemctl reload slurmdbd >/dev/null 2>/dev/null || true
>>>>>      /bin/sleep 1
>>>>>      systemctl reload slurmctld >/dev/null 2>/dev/null || true
>>>>>    endscript
>>>>>
>>>>>  Does it make sense that this could be causing the issue?
>>>>>
>>>>>  In slurm.conf I had InactiveLimit=60 which I guess is what is happening
>>>>>  but my reading of the docs on this setting was it only affects the
>>>>>  starting of a job with srun/salloc and not a job that has been running
>>>>>  for days.  Is it InactiveLimit that leads to the "inactivity time limit
>>>>>  reached" message?
>>>>>
>>>>>  Anyway, I have changed InactiveLimit=600 to see if that helps.