[slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

Jessica Nettelblad jessica.nettelblad at gmail.com
Thu Feb 22 03:32:45 MST 2018


We experienced the same problem. On our two new clusters with smaller
databases (<1 million jobs), the upgrade from 17.02.9 to 17.11.2 and
17.11.3 was quick and smooth. On the third, older cluster, where we have a
larger database (>30 million jobs) the upgrade was a mess, both in mysql
and mariadb. It got stuck on that exact query, energy consumption in one
job table. I did some tricks to get around it, only to get stuck on other
queries instead.

I put some time on it without figuring out exactly why the conversion got
stuck all the time. Then I decided to install 17.11 with a fresh database,
and add necessary info to it.

Basically, all our policy information is regularly imported from an
external infrastructure, so we could rerun those scripts to recreate the
data. Keeping our historical accounting data "hot" in the database was also
not needed, although it has been convenient at times -- hence we had not
been actively purging it before. All things considered, I decided not to
dig deeper into the conversion issue.

We're very happy with the performance of 17.11 now that it's up and
running, they've cleaned up a bunch of unnecessary locks that have caused
bottlenecks for us in the past. Good luck with the conversion!


On Thu, Feb 22, 2018 at 1:30 AM, Christopher Benjamin Coffey <
Chris.Coffey at nau.edu> wrote:

> This is great to know Kurt. We can't be the only folks running into this..
> I wonder if the mysql update code gets into a deadlock or something. I'm
> hoping a slurm dev will chime in ...
>
> Kurt, out of band if need be, I'd be interested in the details of what you
> ended up doing.
>
> Best,
> Chris
>
>> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>
>
> On 2/21/18, 5:08 PM, "slurm-users on behalf of Kurt H Maier" <
> slurm-users-bounces at lists.schedmd.com on behalf of khm at sciops.net> wrote:
>
>     On Wed, Feb 21, 2018 at 11:56:38PM +0000, Christopher Benjamin Coffey
> wrote:
>     > Hello,
>     >
>     > We have been trying to upgrade slurm on our cluster from 16.05.6 to
> 17.11.3. I'm thinking this should be doable? Past upgrades have been a
> breeze, and I believe during the last one, the db upgrade took like 25
> minutes. Well now, the db upgrade process is taking far too long. We
> previously attempted the upgrade during a maintenance window and the
> upgrade process did not complete after 24 hrs. I gave up on the upgrade and
> reverted the slurm version back by restoring a backup db.
>
>     We hit this on our try as well: upgrading from 17.02.9 to 17.11.3.  We
>     truncated our job history for the upgrade, and then did the rest of the
>     conversion out-of-band and re-imported it after the fact.  It took us
>     almost sixteen hours to convert a 1.5 million-job store.
>
>     We got hung up on precisely the same query you did, on a similarly
> hefty
>     machine.  It caused us to roll back an upgrade and try again during our
>     subsequent maintenance window with the above approach.
>
>     khm
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180222/09523d9d/attachment-0001.html>


More information about the slurm-users mailing list