[slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

Malte Thoma Malte.Thoma at awi.de
Thu Feb 22 04:17:30 MST 2018


FYI:
* We broke our upgrade from 17.02.1-2 to 17.11.2 after about 18 h.
* Dropped the job table ("truncate xyz_job_table;")
* Executed the everlasting sql command by hand on a back-up database
* Meanwhile we did the slurm upgrade (fast&easy)
* Reset the First-Job-ID to a high number
* Inserted the converted datatable in the real database again.

It took two experts for this task and we would appreciate a better upgrade-concept very much!
I fact, we hesitate to upgrade from 17.11.2  to 17.11.3, because we are afraid of similar problems. Does anyone has experience with 
this?

It would be good to know if there is ANY chance if future upgrades will cause the same problems or if this will become better?

Regards,
Malte






Am 22.02.2018 um 01:30 schrieb Christopher Benjamin Coffey:
> This is great to know Kurt. We can't be the only folks running into this.. I wonder if the mysql update code gets into a deadlock or something. I'm hoping a slurm dev will chime in ...
> 
> Kurt, out of band if need be, I'd be interested in the details of what you ended up doing.
> 
> Best,
> Chris
> 
>> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>   
> 
> On 2/21/18, 5:08 PM, "slurm-users on behalf of Kurt H Maier" <slurm-users-bounces at lists.schedmd.com on behalf of khm at sciops.net> wrote:
> 
>      On Wed, Feb 21, 2018 at 11:56:38PM +0000, Christopher Benjamin Coffey wrote:
>      > Hello,
>      >
>      > We have been trying to upgrade slurm on our cluster from 16.05.6 to 17.11.3. I'm thinking this should be doable? Past upgrades have been a breeze, and I believe during the last one, the db upgrade took like 25 minutes. Well now, the db upgrade process is taking far too long. We previously attempted the upgrade during a maintenance window and the upgrade process did not complete after 24 hrs. I gave up on the upgrade and reverted the slurm version back by restoring a backup db.
>      
>      We hit this on our try as well: upgrading from 17.02.9 to 17.11.3.  We
>      truncated our job history for the upgrade, and then did the rest of the
>      conversion out-of-band and re-imported it after the fact.  It took us
>      almost sixteen hours to convert a 1.5 million-job store.
>      
>      We got hung up on precisely the same query you did, on a similarly hefty
>      machine.  It caused us to roll back an upgrade and try again during our
>      subsequent maintenance window with the above approach.
>      
>      khm
>      
>      
> 

-- 
Malte Thoma        Tel. +49-471-4831-1828
HSM Documentation: https://goo.gl/R4drbb (User)
                    http://goo.gl/c4A5iE  (Admin)
HPC Documentation: https://goo.gl/o435rT (User)
                    https://goo.gl/GMssqe (Admin)
AWI, Geb.E (3125)
Am Handelshafen 12
27570 Bremerhaven
Tel. +49-471-4831-1828



More information about the slurm-users mailing list