[slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

Roshan Thomas Mathew roshanthomasmathew at gmail.com
Wed Jul 18 02:56:06 MDT 2018


We ran into this issue trying to move from 16.05.3 -> 17.11.7 with 1.5M
records in job table.

In our first attempt, MySQL reported "ERROR 1206 The total number of locks
exceeds the lock table size" after about 7 hours.

Increased InnoDB Buffer Pool size -
https://dba.stackexchange.com/questions/27328/how-large-should-be-mysql-innodb-buffer-pool-size
- to 12G (the machine hosting mysql has 128GB) and restarted the conversion
and which then completed successfully in 6.5 hours.

I am sure there are other MySQL tweaks that can be applied catered towards
SLURM, will be useful if we can pool them together into the documentation.

Cheers,
Roshan



On Wed, 21 Feb 2018 at 23:59, Christopher Benjamin Coffey <
Chris.Coffey at nau.edu> wrote:

> Hello,
>
> We have been trying to upgrade slurm on our cluster from 16.05.6 to
> 17.11.3. I'm thinking this should be doable? Past upgrades have been a
> breeze, and I believe during the last one, the db upgrade took like 25
> minutes. Well now, the db upgrade process is taking far too long. We
> previously attempted the upgrade during a maintenance window and the
> upgrade process did not complete after 24 hrs. I gave up on the upgrade and
> reverted the slurm version back by restoring a backup db.
>
> Since the failed attempt at the upgrade, I've archived a bunch of jobs as
> we had 4 years of jobs in the database. Now only keeping last 1.5 years
> worth. This reduced our db size down from 3.7GB to 1.1GB. We are now
> archiving jobs regularly through slurm.
>
> I've finally had time to look at this a bit more and we've restored the
> reduced database onto another system to test the upgrade process in a dev
> environment, hoping to prove that the slimmed down db will upgrade within a
> reasonable amount of time. Yet, the current upgrade on this dev system has
> already taken 20 hrs. The database has 1.8M jobs. That doesn't seem like
> that many jobs!
>
> The conversion is stuck on this command:
>
> update "monsoon_job_table" as job left outer join ( select job_db_inx,
> SUM(consumed_energy) 'sum_energy' from "monsoon_step_table" where id_step
> >= 0 and consumed_energy != 18446744073709551614 group by job_db_inx ) step
> on job.job_db_inx=step.job_db_inx set job.tres_alloc=concat(job.tres_alloc,
> concat(',3=', case when step.sum_energy then step.sum_energy else
> 18446744073709551614 END)) where job.tres_alloc != '' && job.tres_alloc not
> like '%,3=%':
>
> The system is no slouch:
>
> 28 core, E5-2680 v4 2.4GHz
> SSD
> 128GB memory
>
> Anyone have this issue? Anyone have a suggestion? This seems like a
> ridiculous amount of time needed to perform the upgrade! The database is
> healthy as far as I see. No errors in the slurmdbd log, etc.
>
> Let me know if you need more info!
>
> Best,
> Chris
> 
>> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180718/7fa066b1/attachment-0001.html>


More information about the slurm-users mailing list