<div dir="ltr">
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">TLDR; If you get a timeout for the Slurm database, and a longer timelimit in innodb doesn't help, you might want to consider loosening the lock mode in MariaDB. <span></span></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">The long story! <br></span></p><p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">So, we’ve just upgraded our main cluster to 17.11.3 and moved our database to Mariadb. There have been some glitches and this one falls into the category where it’s not an actual bug, but our experience might still be interesting to someone who is doing sacctmgr delete and find Slurmdbd crashing. After changing the MariaDB configuration, it worked again, and I didn't try to repro the issue again or test it further. But here's what I saw from fixing the problem for us.<span></span></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">THE ERROR<span></span></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">Slurmdbd repeatably died, with the error message “fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error.” Setting innodb_lock_wait_timeout in my.cnf to a higher value didn’t solve the problem.<span></span></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">One single query from a script seemed to be the only thing needed to create this lock situation: sacctmgr -i delete account where account=$accountname cluster=$cluster_name` <span></span></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">THE PROBLEM<span></span></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">A delete by sacctmgr is followed up with an alter table, in the same transaction. <a href="https://github.com/SchedMD/slurm/blob/master/src/plugins/accounting_storage/mysql/accounting_storage_mysql.c" style="color:rgb(5,99,193);text-decoration:underline">https://github.com/SchedMD/slurm/blob/master/src/plugins/accounting_storage/mysql/accounting_storage_mysql.c</a><span></span></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">This seems to be problematic using pretty standard configurations for MariaDB in Centos7. The query seems to create a lock conflict with itself. “Waiting for table metadata lock | alter table "milou_assoc_table" AUTO_INCREMENT=0”<span></span></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">THE FIX<span></span></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">The Slurm code already postpones the ALTER TABLE call until the end of the transaction, noting that a rollback won’t be possible afterwards. Mixing DDL and DML SQL statements in the same transaction, for the same table, might not be wise. <span></span></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">A quicker solution that I opted for, in the middle of a service stop with our systems down, was to change the MariaDB configuration. Instead of 1, I set innodb_autoinc_lock_mode=2, allowing for looser locks. <span></span></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">OUR SETUP<span></span></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">We are running Slurm 17.11.3 on a 300-node Centos7 cluster with MariaDB 5.5.56.-2. We have all old and new users in our LDAP and information on expiration of projects in a separate external structure. Only projects that are active (not expired) and users belonging to at least one such projects, are listed in the Slurm database. At regular intervals, expired data is removed using sacctmgr delete.<span></span></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">SOME <span> </span>COMMENTS<br>Since we moved the database to MariaDB and upgraded to 17.11 at the same time, I don’t know how MariaDB behaved with previous Slurm versions.<span></span></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">We got this issue with delete, and changing this configuration fixed it. There might be problems with other queries too.<span></span></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">Changing to a looser lock mode might introduce new issues, especially depending on what backup and recovery solutions you have planned for your database. I set innodb_autoinc_lock_mode=2, but it is possible that the “traditional” value of 0 will also work.<span></span></span></p>
<p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US">That’s it! It would be interesting to hear if someone else has encountered this problem and how you solved it.</span></p><p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif">Best regards, <br></p><p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif">Jessica Nettelblad, UPPMAX<br></p><p class="MsoNormal" style="margin:0cm 0cm 8pt;line-height:107%;font-size:11pt;font-family:"Calibri",sans-serif"><span lang="EN-US"><span></span></span></p>
<br></div>