[slurm-users] slurmdbd purge not working
Julien Rey
julien.rey at univ-paris-diderot.fr
Mon Apr 8 14:31:06 UTC 2019
Hi Ole,
Thank you for your advice.
As I said in my previous messages, this is how I set the my.cnf:
innodb_buffer_pool_size = 32G
innodb_log_file_size = 64M
innodb_lock_wait_timeout = 3600
I have read the thread "Extreme long db upgrade 16.05.6 -> 17.11.3".
However I have no way to purge the database with the slurmdbd tool. And
I still don't know if I can do it by hand with the mysql command line.
On top of that, the mysql daemon isn't starting anymore since this
afternoon, and I get this kind of errors:
190408 16:22:08 mysqld_safe Starting mysqld daemon with databases from
/var/lib/mysql
190408 16:22:08 [Warning] Using unique option prefix key_buffer instead
of key_buffer_size is deprecated and will be removed in a future
release. Please use the full name instead.
190408 16:22:08 [Warning] Using unique option prefix myisam-recover
instead of myisam-recover-options is deprecated and will be removed in a
future release. Please use the full name instead.
190408 16:22:08 [Note] Plugin 'FEDERATED' is disabled.
190408 16:22:08 InnoDB: The InnoDB memory heap is disabled
190408 16:22:08 InnoDB: Mutexes and rw_locks use GCC atomic builtins
190408 16:22:08 InnoDB: Compressed tables use zlib 1.2.8
190408 16:22:08 InnoDB: Using Linux native AIO
190408 16:22:08 InnoDB: Initializing buffer pool, size = 32.0G
190408 16:22:09 InnoDB: Completed initialization of buffer pool
190408 16:22:10 InnoDB: highest supported file format is Barracuda.
InnoDB: No valid checkpoint found.
InnoDB: If this error appears when you are creating an InnoDB database,
InnoDB: the problem may be that during an earlier attempt you managed
InnoDB: to create the InnoDB data files, but log file creation failed.
InnoDB: If that is the case, please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/error-creating-innodb.html
190408 16:22:10 [ERROR] Plugin 'InnoDB' init function returned error.
190408 16:22:10 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE
failed.
190408 16:22:10 [ERROR] Unknown/unsupported storage engine: InnoDB
190408 16:22:10 [ERROR] Aborting
190408 16:22:10 [Note] /usr/sbin/mysqld: Shutdown complete
190408 16:22:10 mysqld_safe mysqld from pid file
/var/run/mysqld/mysqld.pid ended
Should I just drop the database altogether and upgrade slurm ? What
would be the procedure for re-creating a database from scratch ?
J.
Le 05/04/2019 16:43, Ole Holm Nielsen a écrit :
> Hi Julien,
>
> Did you optimize the MySQL database, in particular InnoDB?
>
> I have collected some documentation in my Wiki page
> https://wiki.fysik.dtu.dk/niflheim/Slurm_database#mysql-configuration
> and I also discuss database purging.
>
> Please note that we run Slurm 17.11 (and recently 18.08) on CentOS 7.6
> systems which come with a MariaDB 5.5 database. We have no problems
> with the database or the daily purging operations (see
> https://wiki.fysik.dtu.dk/niflheim/Slurm_database#setting-database-purge-parameters).
>
> If you are upgrading your Slurm version (or planning to do it), I also
> recommend you to read the thread [slurm-users] "Extreme long db
> upgrade 16.05.6 -> 17.11.3" from the last few days.
>
> Best regards,
> Ole
>
> On 4/5/19 4:28 PM, Julien Rey wrote:
>> The failure occurs after a few minutes (~10).
>>
>> And we are running out of space on the slurm controller. The mysql
>> daemon is at 100% CPU usage all the time. This issue is becoming
>> critical.
>>
>> Le 05/04/2019 16:10, Paul Edmon a écrit :
>>> Did it just time out, or did that failure happen immediately. If
>>> immediate you may be in a situation where you are hitting a bug. It
>>> "should" be safe to upgrade to a later version of 15.08.*. There may
>>> be fixes in there related to that. I would look at the changelog
>>> though just to see if there is any database work that was done.
>>>
>>> -Paul Edmon-
>>>
>>> On 4/5/19 9:05 AM, Julien Rey wrote:
>>>> Hi Paul, thanks for your advice. Actually I already tried what you
>>>> suggested. No matter what value do I put after PurgeJobAfter I
>>>> always end up with the same error:
>>>>
>>>> sacctmgr archive dump Directory=/home/joule/archives/
>>>> PurgeJobAfter=1days
>>>> sacctmgr: error: slurmdbd: Getting response to message type 1459
>>>> sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
>>>> Problem dumping archive: Unspecified error
>>>>
>>>> sacctmgr archive dump Directory=/home/joule/archives/
>>>> PurgeJobAfter=48months
>>>> sacctmgr: error: slurmdbd: Getting response to message type 1459
>>>> sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
>>>> Problem dumping archive: Unspecified error
>>>>
>>>> Has anyone tried to truncate tables by hand directly in the mysql
>>>> command line ?
>>>>
>>>> Le 04/04/2019 16:13, Paul Edmon a écrit :
>>>>> We ran into this problem in the past. I know that fixes were put
>>>>> in to deal with large purges as a result of our problems but I
>>>>> don't recall what version they ended up in, likely newer than
>>>>> 15.08.0.
>>>>>
>>>>> A solution that can work is to walk up the time so that instead of
>>>>> one large purge you do several smaller purges. That at least
>>>>> worked for us in the past.
>>>>>
>>>>> -Paul Edmon-
>>>>>
>>>>> On 4/4/19 9:38 AM, Julien Rey wrote:
>>>>>> Hello,
>>>>>>
>>>>>> Our slurm accounting database is growing bigger and bigger (more
>>>>>> than 100Gb) and is never being purged. We are running slurm
>>>>>> 15.08.0-0pre1. I would like to upgrade to a more recent version
>>>>>> of the slurmdbd, but my fear is that it may break everything
>>>>>> during the update of the database.
>>>>>>
>>>>>> Here is our slurmdbd.conf :
>>>>>>
>>>>>> AuthType=auth/munge
>>>>>> AuthInfo=/var/run/munge/munge.socket.2
>>>>>> DbdHost=localhost
>>>>>> DebugLevel=6
>>>>>> StorageHost=localhost
>>>>>> StorageLoc=slurm_acct_db
>>>>>> StoragePass=shazaam
>>>>>> StorageType=accounting_storage/mysql
>>>>>> StorageUser=slurm
>>>>>> LogFile=/var/log/slurm-llnl/slurmdbd.log
>>>>>> PidFile=/var/run/slurm-llnl/slurmdbd.pid
>>>>>> SlurmUser=slurm
>>>>>> ArchiveDir=/home/joule/archives
>>>>>> PurgeEventAfter=18
>>>>>> PurgeJobAfter=18
>>>>>> PurgeResvAfter=1
>>>>>> PurgeStepAfter=1
>>>>>> PurgeSuspendAfter=1
>>>>>>
>>>>>> I tried to purge it manually using this command but the slurmdbd
>>>>>> daemon ends up crashing and it doesn't remove anything:
>>>>>>
>>>>>> sacctmgr archive dump Directory=/home/joule/archives/
>>>>>> PurgeJobAfter=365days
>>>>>>
>>>>>> sacctmgr: error: slurmdbd: Getting response to message type 1459
>>>>>> sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
>>>>>> Problem dumping archive: Unspecified error
>>>>>>
>>>>>> Sometimes I have to restart the mysql daemon (we are running
>>>>>> mysql 5.5.39-1). The /var/log/slurm-llnl/slurmdbd.log shows
>>>>>> nothings. The mysql logs are empty.
>>>>>>
>>>>>> I tried to increase these values in my.cnf but so far no success :
>>>>>>
>>>>>> innodb_buffer_pool_size = 32G
>>>>>> innodb_lock_wait_timeout = 3600
>>>>>>
>>>>>> Is there any way to solve this issue ? Otherwise, what would be
>>>>>> the procedure for deleting the database records altogether and
>>>>>> starting on a fresh new one ?
>
--
Julien REY
Plate-forme RPBS
Modélisation Computationnelle des Interactions Protéines-Ligand (CMPLI)
Université Paris Diderot - Paris VII
tel : 01 57 27 83 95
More information about the slurm-users
mailing list