[slurm-users] job_time_limit: inactivity time limit reached ...

Paul Raines raines at nmr.mgh.harvard.edu
Tue Sep 20 13:17:39 UTC 2022


Further investigation found that I had setup logrotate to handle a mysql
dump

   mysqldump -R --single-transaction -B slurm_db | bzip2

which is what is taking 5 minutes.  I think this is locking tables during
the time hanging calls to slurmdbd most likely and causing the issue.
I will need to rework it.

-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Mon, 19 Sep 2022 9:29am, Reed Dier wrote:

> I’m not sure if this might be helpful, but my logrotate.d for slurm looks a bit differently, namely instead of a systemctl reload, I am sending a specific SIGUSR2 signal, which is supposedly for the specific purpose of logrotation in slurm.
>
>>     postrotate
>>             pkill -x --signal SIGUSR2 slurmctld
>>             pkill -x --signal SIGUSR2 slurmd
>>             pkill -x --signal SIGUSR2 slurmdbd
>>             exit 0
>>     endscript
>
> I would take a look here: https://slurm.schedmd.com/slurm.conf.html#lbAQ <https://slurm.schedmd.com/slurm.conf.html#lbAQ>
>
> Reed
>
>> On Sep 19, 2022, at 7:46 AM, Paul Raines <raines at nmr.mgh.harvard.edu> wrote:
>>
>>
>> I have had two nights where right at 3:35am a bunch of jobs were
>> killed early with TIMEOUT way before  their normal TimeLimit.
>> The slurmctld log has lots of lines like at 3:35am with
>>
>> [2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached for JobId=1636922
>>
>> with jobs running on serveral different nodes.
>>
>> The one curious thing is right about this time log rotation is happening
>> in cron on the slurmctld master node
>>
>> Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily) starting logrotate
>> Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily) finished logrotate
>>
>> The 5 minute runtime here is a big anomoly.  On other machines, like
>> nodes just running slurmd or my web servers, this only takes a couple of seconds.
>>
>> In /etc/logrotate.d/slurmctl I have
>>
>>   postrotate
>>     systemctl reload slurmdbd >/dev/null 2>/dev/null || true
>>     /bin/sleep 1
>>     systemctl reload slurmctld >/dev/null 2>/dev/null || true
>>   endscript
>>
>> Does it make sense that this could be causing the issue?
>>
>> In slurm.conf I had InactiveLimit=60 which I guess is what is happening
>> but my reading of the docs on this setting was it only affects the
>> starting of a job with srun/salloc and not a job that has been running
>> for days.  Is it InactiveLimit that leads to the "inactivity time limit reached" message?
>>
>> Anyway, I have changed InactiveLimit=600 to see if that helps.
>>
>>
>> ---------------------------------------------------------------
>> Paul Raines                     http://help.nmr.mgh.harvard.edu
>> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
>> 149 (2301) 13th Street     Charlestown, MA 02129	    USA
>>
>>
>>
>> The information in this e-mail is intended only for the person to whom it is addressed.  If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
>> Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
>>
>
>
The information in this e-mail is intended only for the person to whom it is addressed.  If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail. 


More information about the slurm-users mailing list