[slurm-users] Database Tuning w/SLURM

Fri Mar 22 20:13:45 UTC 2019

> On Mar 22, 2019, at 4:22 AM, Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk> wrote:
> 
> On 3/21/19 6:56 PM, Ryan Novosielski wrote:
>>> On Mar 21, 2019, at 12:21 PM, Loris Bennett <loris.bennett at fu-berlin.de> wrote:
>>> 
>>>  Our last cluster only hit around 2.5 million jobs after
>>> around 6 years, so database conversion was never an issue.  For sites
>>> with a higher-throughput things may be different, but I would hope that
>>> at those places, the managers would know the importance of planned
>>> updates and testing.
>> I’d be curious about any database tuning you might have done, or anyone else here. SchedMD’s guidance is minimal.
>> I’ve ever been impressed with the performance on ours, and I’ve also seen other sites reporting >24 hour database conversion times.
> 
> Database tuning is actually documented by SchedMD, but you have to find the appropriate pages first ;-)

Yeah, I’ve seen it, but there’s very little information provided (similar to what you’ve got listed). The major difference between theirs is the further mention of “you might want to increase innodb_buffer_pool_size quite a bit more than 1024MB.” In my conversations with SchedMD I more or less asked, “is that it? what if it’s still slow, does that mean look somewhere else or keep tweaking.” There is also other advice from SchedMD bugs (the one you mention on your site included), but many of them are for dramatically different versions of MySQL or SlurmDBD and it’s not always easy to tell what still applies. It does depend also on the type of access, the size of the DB, etc., but I don’t have any other size DB than the size I have; presumably the community knows how much is required for whatever kind, or how many years of X amount of job can be kept before you start to have problems with most tuning settings. I have taken some advice from mysqltuner.pl in some cases too, though I’m using basically the SchedMD recommendations right now (that thread_cache_size one was mine — can’t recall where I found it, but it seemed like a good idea for our workload):

[root at squid ~]# cat /etc/my.cnf.d/slurmdbd.cnf 
[mysqld]
innodb_buffer_pool_size=1G
thread_cache_size=4
innodb_log_file_size = 64M
innodb_lock_wait_timeout = 900

> I have collected Slurm database information in my Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_database.  You may want to focus on these sections:
> 
> * MySQL configuration (Innodb configuration)
> 
> * Setting database purge parameters (prune unwanted old database entries)
> 
> * Backup and restore of database (hopefully everyone does this already)
> 
> * Upgrade of MySQL/MariaDB (MySQL versions)
> 
> * Migrate the slurmdbd service to another server (I decided to do that recently)
> 
> I hope this sheds some light on what needs to be considered.

Thanks, it’s helpful to have more information, particularly on purging and the migration process (which doesn’t seem complicated, but it’s nice to simply rip off the steps as opposed to having to write them :-D).

The tug-of-war on our system comes from SlurmDBD often needing quite a bit of memory itself for certain operations, and it sits on the MySQL server. I sometimes wonder whether it might not be better to colocate SlurmDBD with slurmctld, separating them both from the MySQL server.

PS: mainly for Prentice, Ole’s site has the thread from this list that mentioned the very large DB upgrade time:
https://lists.schedmd.com/pipermail/slurm-users/2018-February/000612.html — we tested the DB upgrade first independently because of that risk.

--
____
|| \\UTGERS,  	 |---------------------------*O*---------------------------
||_// the State	 |         Ryan Novosielski - novosirj at rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ	 | Office of Advanced Research Computing - MSB C630, Newark
     `'