[slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

Mon Apr 1 14:55:09 UTC 2019

We’ve run into exactly the same problem, i.e. an extremely long upgrade process to the 17.11.x major release. Luckily, we’ve found a solution.

The first approach was to tune various innodb options, like increasing the buffer pool size (8G), the log file size (64M) or the lock wait timeout (900) but that didn’t really help. Even extreme values like 40G for the buffer pool size on a 500G machine didn’t speed things up.
The conversion process still hung on the „sending data“ stage of the aforementioned query, which updates the tres_alloc attribute by using a join operation of the job and step tables. We’ve let the process run on the 500G machine for more than 90 hours and had to kill it.
Keep in mind that our slurm-database contained ca. 11 million jobs in the job table and 13 million job steps in the step table. This may seem like much but mysql should be able to handle such numbers easily.

In order to finish the slurm upgrade within the allotted maintenance window, we’ve had to purge all jobs older than a few months via the slurmdbd-purge options. Although this worked fairly quickly, it didn’t solve the problem of having to convert 11 million jobs.

Further analysis of the query has shown that the mysql optimizer has choosen the wrong execution plan. This may depend on the mysql version, ours was 5.1.69.
Apparently, the "right table" of the join operation wasn’t created first, which resulted in a massive performance loss.
The solution was straightforward - extract the creation of the "right table“ from the query and perform it first by creating a temporary table.
The result: the job table conversion was over in 17 minutes. The entire database upgrade operation was done in 43 minutes.
The jobs were re-inserted into the main database afterwards.

In case this helps, the patch is attached.

Kind regards,
Lech

-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm-17.11.13-2-job_table_convert.patch
Type: application/octet-stream
Size: 1815 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190401/47028fe5/attachment-0001.obj>
-------------- next part --------------

> Am 27.02.2018 um 20:51 schrieb Chris Samuel <chris at csamuel.org>:
> 
> On Wednesday, 28 February 2018 2:13:41 AM AEDT Miguel Gila wrote:
> 
>> Microcode patches were not applied to the physical system, only the kernel
>> was upgraded, so I'm not sure whether the performance hit could come from
>> that or not.
> 
> Yes it would, it's the kernel changes that cause the impact.  My understanding
> is tha the microcode update had features that were intended to mitigate that.
> 
> Also note Intel later withdrew the microcode update due to instability on
> earlier CPUs (Linux distros reverted their firmware updates at that time):
> 
> https://newsroom.intel.com/news/root-cause-of-reboot-issue-identified-updated-guidance-for-customers-and-partners/
> 
> and it appears the most recent update is intended to be pushed out via firmware
> updates rather than a microcode file loaded from the OS.
> 
> https://newsroom.intel.com/news/latest-intel-security-news-updated-firmware-available/
> 
> Best of luck!
> Chris
> -- 
> Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
> 
>