[slurm-users] job_time_limit: inactivity time limit reached ...

Paul Raines raines at nmr.mgh.harvard.edu
Wed Sep 21 12:38:22 UTC 2022


Almost all the 5 min+ time was in the bzip2.  The mysqldump by itself was 
about 16 seconds.  So I moved the bzip2 to its own separate line so
the tables are only locked for the ~16 seconds

-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Wed, 21 Sep 2022 3:49am, Ole Holm Nielsen wrote:

> Hi Paul,
>
> IMHO, using logrotate is the most convenient method for making daily database 
> backup dumps and keep a number of backup versions, see the notes in
> https://wiki.fysik.dtu.dk/niflheim/Slurm_database#backup-script-with-logrotate
>
> Using --single-transaction is recommended by SchedMD to avoid race conditions 
> when slurmdbd is being run while taking the MySQL dump, see
> https://bugs.schedmd.com/show_bug.cgi?id=10295#c18
>
> /Ole
>
> On 9/20/22 15:17, Paul Raines wrote:
>>
>>  Further investigation found that I had setup logrotate to handle a mysql
>>  dump
>>
>>     mysqldump -R --single-transaction -B slurm_db | bzip2
>>
>>  which is what is taking 5 minutes.  I think this is locking tables during
>>  the time hanging calls to slurmdbd most likely and causing the issue.
>>  I will need to rework it.
>>
>>  -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>>
>>
>>
>>  On Mon, 19 Sep 2022 9:29am, Reed Dier wrote:
>>
>>>  I’m not sure if this might be helpful, but my logrotate.d for slurm looks
>>>  a bit differently, namely instead of a systemctl reload, I am sending a
>>>  specific SIGUSR2 signal, which is supposedly for the specific purpose of
>>>  logrotation in slurm.
>>>
>>>>      postrotate
>>>>              pkill -x --signal SIGUSR2 slurmctld
>>>>              pkill -x --signal SIGUSR2 slurmd
>>>>              pkill -x --signal SIGUSR2 slurmdbd
>>>>              exit 0
>>>>      endscript
>>>
>>>  I would take a look here: https://slurm.schedmd.com/slurm.conf.html#lbAQ
>>>  <https://slurm.schedmd.com/slurm.conf.html#lbAQ>
>>>
>>>  Reed
>>>
>>>>  On Sep 19, 2022, at 7:46 AM, Paul Raines <raines at nmr.mgh.harvard.edu>
>>>>  wrote:
>>>> 
>>>>
>>>>  I have had two nights where right at 3:35am a bunch of jobs were
>>>>  killed early with TIMEOUT way before  their normal TimeLimit.
>>>>  The slurmctld log has lots of lines like at 3:35am with
>>>>
>>>>  [2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached
>>>>  for JobId=1636922
>>>>
>>>>  with jobs running on serveral different nodes.
>>>>
>>>>  The one curious thing is right about this time log rotation is happening
>>>>  in cron on the slurmctld master node
>>>>
>>>>  Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily) starting
>>>>  logrotate
>>>>  Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily) finished
>>>>  logrotate
>>>>
>>>>  The 5 minute runtime here is a big anomoly.  On other machines, like
>>>>  nodes just running slurmd or my web servers, this only takes a couple of
>>>>  seconds.
>>>>
>>>>  In /etc/logrotate.d/slurmctl I have
>>>>
>>>>    postrotate
>>>>      systemctl reload slurmdbd >/dev/null 2>/dev/null || true
>>>>      /bin/sleep 1
>>>>      systemctl reload slurmctld >/dev/null 2>/dev/null || true
>>>>    endscript
>>>>
>>>>  Does it make sense that this could be causing the issue?
>>>>
>>>>  In slurm.conf I had InactiveLimit=60 which I guess is what is happening
>>>>  but my reading of the docs on this setting was it only affects the
>>>>  starting of a job with srun/salloc and not a job that has been running
>>>>  for days.  Is it InactiveLimit that leads to the "inactivity time limit
>>>>  reached" message?
>>>>
>>>>  Anyway, I have changed InactiveLimit=600 to see if that helps.
>
>
>
>
The information in this e-mail is intended only for the person to whom it is addressed.  If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail. 


More information about the slurm-users mailing list