[slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

Christopher Benjamin Coffey Chris.Coffey at nau.edu
Wed Feb 21 16:56:38 MST 2018


Hello,

We have been trying to upgrade slurm on our cluster from 16.05.6 to 17.11.3. I'm thinking this should be doable? Past upgrades have been a breeze, and I believe during the last one, the db upgrade took like 25 minutes. Well now, the db upgrade process is taking far too long. We previously attempted the upgrade during a maintenance window and the upgrade process did not complete after 24 hrs. I gave up on the upgrade and reverted the slurm version back by restoring a backup db.

Since the failed attempt at the upgrade, I've archived a bunch of jobs as we had 4 years of jobs in the database. Now only keeping last 1.5 years worth. This reduced our db size down from 3.7GB to 1.1GB. We are now archiving jobs regularly through slurm.

I've finally had time to look at this a bit more and we've restored the reduced database onto another system to test the upgrade process in a dev environment, hoping to prove that the slimmed down db will upgrade within a reasonable amount of time. Yet, the current upgrade on this dev system has already taken 20 hrs. The database has 1.8M jobs. That doesn't seem like that many jobs!

The conversion is stuck on this command:

update "monsoon_job_table" as job left outer join ( select job_db_inx, SUM(consumed_energy) 'sum_energy' from "monsoon_step_table" where id_step >= 0 and consumed_energy != 18446744073709551614 group by job_db_inx ) step on job.job_db_inx=step.job_db_inx set job.tres_alloc=concat(job.tres_alloc, concat(',3=', case when step.sum_energy then step.sum_energy else 18446744073709551614 END)) where job.tres_alloc != '' && job.tres_alloc not like '%,3=%':

The system is no slouch:

28 core, E5-2680 v4 2.4GHz
SSD
128GB memory

Anyone have this issue? Anyone have a suggestion? This seems like a ridiculous amount of time needed to perform the upgrade! The database is healthy as far as I see. No errors in the slurmdbd log, etc.

Let me know if you need more info!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 



More information about the slurm-users mailing list