[slurm-users] slurmdbd purge not working

Mon Apr 8 14:31:06 UTC 2019

Hi Ole,

Thank you for your advice.

As I said in my previous messages, this is how I set the my.cnf:

innodb_buffer_pool_size         = 32G
innodb_log_file_size            = 64M
innodb_lock_wait_timeout        = 3600

I have read the thread "Extreme long db upgrade 16.05.6 -> 17.11.3". 
However I have no way to purge the database with the slurmdbd tool. And 
I still don't know if I can do it by hand with the mysql command line.

On top of that, the mysql daemon isn't starting anymore since this 
afternoon, and I get this kind of errors:

190408 16:22:08 mysqld_safe Starting mysqld daemon with databases from 
/var/lib/mysql
190408 16:22:08 [Warning] Using unique option prefix key_buffer instead 
of key_buffer_size is deprecated and will be removed in a future 
release. Please use the full name instead.
190408 16:22:08 [Warning] Using unique option prefix myisam-recover 
instead of myisam-recover-options is deprecated and will be removed in a 
future release. Please use the full name instead.
190408 16:22:08 [Note] Plugin 'FEDERATED' is disabled.
190408 16:22:08 InnoDB: The InnoDB memory heap is disabled
190408 16:22:08 InnoDB: Mutexes and rw_locks use GCC atomic builtins
190408 16:22:08 InnoDB: Compressed tables use zlib 1.2.8
190408 16:22:08 InnoDB: Using Linux native AIO
190408 16:22:08 InnoDB: Initializing buffer pool, size = 32.0G
190408 16:22:09 InnoDB: Completed initialization of buffer pool
190408 16:22:10 InnoDB: highest supported file format is Barracuda.
InnoDB: No valid checkpoint found.
InnoDB: If this error appears when you are creating an InnoDB database,
InnoDB: the problem may be that during an earlier attempt you managed
InnoDB: to create the InnoDB data files, but log file creation failed.
InnoDB: If that is the case, please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/error-creating-innodb.html
190408 16:22:10 [ERROR] Plugin 'InnoDB' init function returned error.
190408 16:22:10 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE 
failed.
190408 16:22:10 [ERROR] Unknown/unsupported storage engine: InnoDB
190408 16:22:10 [ERROR] Aborting

190408 16:22:10 [Note] /usr/sbin/mysqld: Shutdown complete

190408 16:22:10 mysqld_safe mysqld from pid file 
/var/run/mysqld/mysqld.pid ended

Should I just drop the database altogether and upgrade slurm ? What 
would be the procedure for re-creating a database from scratch ?

J.

Le 05/04/2019 16:43, Ole Holm Nielsen a écrit :
> Hi Julien,
>
> Did you optimize the MySQL database, in particular InnoDB?
>
> I have collected some documentation in my Wiki page 
> https://wiki.fysik.dtu.dk/niflheim/Slurm_database#mysql-configuration
> and I also discuss database purging.
>
> Please note that we run Slurm 17.11 (and recently 18.08) on CentOS 7.6 
> systems which come with a MariaDB 5.5 database.  We have no problems 
> with the database or the daily purging operations (see 
> https://wiki.fysik.dtu.dk/niflheim/Slurm_database#setting-database-purge-parameters).
>
> If you are upgrading your Slurm version (or planning to do it), I also 
> recommend you to read the thread [slurm-users] "Extreme long db 
> upgrade 16.05.6 -> 17.11.3" from the last few days.
>
> Best regards,
> Ole
>
> On 4/5/19 4:28 PM, Julien Rey wrote:
>> The failure occurs after a few minutes (~10).
>>
>> And we are running out of space on the slurm controller. The mysql 
>> daemon is at 100% CPU usage all the time. This issue is becoming 
>> critical.
>>
>> Le 05/04/2019 16:10, Paul Edmon a écrit :
>>> Did it just time out, or did that failure happen immediately.  If 
>>> immediate you may be in a situation where you are hitting a bug. It 
>>> "should" be safe to upgrade to a later version of 15.08.*. There may 
>>> be fixes in there related to that.  I would look at the changelog 
>>> though just to see if there is any database work that was done.
>>>
>>> -Paul Edmon-
>>>
>>> On 4/5/19 9:05 AM, Julien Rey wrote:
>>>> Hi Paul, thanks for your advice. Actually I already tried what you 
>>>> suggested. No matter what value do I put after PurgeJobAfter I 
>>>> always end up with the same error:
>>>>
>>>> sacctmgr archive dump Directory=/home/joule/archives/ 
>>>> PurgeJobAfter=1days
>>>> sacctmgr: error: slurmdbd: Getting response to message type 1459
>>>> sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
>>>>  Problem dumping archive: Unspecified error
>>>>
>>>> sacctmgr archive dump Directory=/home/joule/archives/ 
>>>> PurgeJobAfter=48months
>>>> sacctmgr: error: slurmdbd: Getting response to message type 1459
>>>> sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
>>>>  Problem dumping archive: Unspecified error
>>>>
>>>> Has anyone tried to truncate tables by hand directly in the mysql 
>>>> command line ?
>>>>
>>>> Le 04/04/2019 16:13, Paul Edmon a écrit :
>>>>> We ran into this problem in the past.  I know that fixes were put 
>>>>> in to deal with large purges as a result of our problems but I 
>>>>> don't recall what version they ended up in, likely newer than 
>>>>> 15.08.0.
>>>>>
>>>>> A solution that can work is to walk up the time so that instead of 
>>>>> one large purge you do several smaller purges. That at least 
>>>>> worked for us in the past.
>>>>>
>>>>> -Paul Edmon-
>>>>>
>>>>> On 4/4/19 9:38 AM, Julien Rey wrote:
>>>>>> Hello,
>>>>>>
>>>>>> Our slurm accounting database is growing bigger and bigger (more 
>>>>>> than 100Gb) and is never being purged. We are running slurm 
>>>>>> 15.08.0-0pre1. I would like to upgrade to a more recent version 
>>>>>> of the slurmdbd, but my fear is that it may break everything 
>>>>>> during the update of the database.
>>>>>>
>>>>>> Here is our slurmdbd.conf :
>>>>>>
>>>>>> AuthType=auth/munge
>>>>>> AuthInfo=/var/run/munge/munge.socket.2
>>>>>> DbdHost=localhost
>>>>>> DebugLevel=6
>>>>>> StorageHost=localhost
>>>>>> StorageLoc=slurm_acct_db
>>>>>> StoragePass=shazaam
>>>>>> StorageType=accounting_storage/mysql
>>>>>> StorageUser=slurm
>>>>>> LogFile=/var/log/slurm-llnl/slurmdbd.log
>>>>>> PidFile=/var/run/slurm-llnl/slurmdbd.pid
>>>>>> SlurmUser=slurm
>>>>>> ArchiveDir=/home/joule/archives
>>>>>> PurgeEventAfter=18
>>>>>> PurgeJobAfter=18
>>>>>> PurgeResvAfter=1
>>>>>> PurgeStepAfter=1
>>>>>> PurgeSuspendAfter=1
>>>>>>
>>>>>> I tried to purge it manually using this command but the slurmdbd 
>>>>>> daemon ends up crashing and it doesn't remove anything:
>>>>>>
>>>>>> sacctmgr archive dump Directory=/home/joule/archives/ 
>>>>>> PurgeJobAfter=365days
>>>>>>
>>>>>> sacctmgr: error: slurmdbd: Getting response to message type 1459
>>>>>> sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
>>>>>>  Problem dumping archive: Unspecified error
>>>>>>
>>>>>> Sometimes I have to restart the mysql daemon (we are running 
>>>>>> mysql 5.5.39-1). The /var/log/slurm-llnl/slurmdbd.log shows 
>>>>>> nothings. The mysql logs are empty.
>>>>>>
>>>>>> I tried to increase these values in my.cnf but so far no success :
>>>>>>
>>>>>> innodb_buffer_pool_size        = 32G
>>>>>> innodb_lock_wait_timeout    = 3600
>>>>>>
>>>>>> Is there any way to solve this issue ? Otherwise, what would be 
>>>>>> the procedure for deleting the database records altogether and 
>>>>>> starting on a fresh new one ?
>

-- 
Julien REY

Plate-forme RPBS
Modélisation Computationnelle des Interactions Protéines-Ligand (CMPLI)
Université Paris Diderot - Paris VII
tel : 01 57 27 83 95