[slurm-users] slurmdbd purge not working

Lech Nieroda lech.nieroda at uni-koeln.de
Mon Apr 8 16:21:05 UTC 2019


Hello Julien,

the innodb engine may stop working if you change parameters such as innodb_log_file_size without rebuilding the database, as the expected values no longer correspond to the encountered ones. Try using the old parameters.
In order to debug the archive dump error you might want to run slurmdbd manually and with a higher debug level, e.g. „slurmdbd -D -vvvv“. That might show what happens on the server side when you run the sacctmgr command in much greater detail. 
Some common errors would be missing write permissions to the specified archive directory, e.g. for the „slurm“ user, or no space left on the storage device or various timeouts. Have you checked if any files are actually written into the directory? Were any jobs purged?
You can check the number of stored jobs directly with "select count(*) from <clustername>_job_table"

If everything else fails, you can still purge the job and step records manually by using mysql. The slurmdbd service would need to be shut down for that.
I haven’t checked slurm 15.x, but in 17.x and 18.x there are three tables involved when dealing with jobs directly:
<clustername>_job_table (with the primary key job_db_inx)
<clustername>_step_table (references job_db_inx)
<clustername>_suspend_table (references to  job_db_inx)

Deleting only from the job_table would create inconsistencies, you don’t want to do that. Instead, create a delete query that deletes all records with a smaller job_db_inx than a certain value and perform it on all three tables.
Before doing anything, it’s advisable to perform a database backup with mysqldump.

Kind regards,
Lech



> Am 08.04.2019 um 16:31 schrieb Julien Rey <julien.rey at univ-paris-diderot.fr>:
> 
> On top of that, the mysql daemon isn't starting anymore since this afternoon, and I get this kind of errors:
> 
> ...
> 190408 16:22:10 [ERROR] Plugin 'InnoDB' init function returned error.
> 190408 16:22:10 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
> 190408 16:22:10 [ERROR] Unknown/unsupported storage engine: InnoDB
> 190408 16:22:10 [ERROR] Aborting
> 
> 190408 16:22:10 [Note] /usr/sbin/mysqld: Shutdown complete
> 
> 190408 16:22:10 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
> 
> 
> Should I just drop the database altogether and upgrade slurm ? What would be the procedure for re-creating a database from scratch ?
> 
> 
> J.
> 
> Le 05/04/2019 16:43, Ole Holm Nielsen a écrit :
>> Hi Julien,
>> 
>> Did you optimize the MySQL database, in particular InnoDB?
>> 
>> I have collected some documentation in my Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_database#mysql-configuration
>> and I also discuss database purging.
>> 
>> Please note that we run Slurm 17.11 (and recently 18.08) on CentOS 7.6 systems which come with a MariaDB 5.5 database.  We have no problems with the database or the daily purging operations (see https://wiki.fysik.dtu.dk/niflheim/Slurm_database#setting-database-purge-parameters).
>> 
>> If you are upgrading your Slurm version (or planning to do it), I also recommend you to read the thread [slurm-users] "Extreme long db upgrade 16.05.6 -> 17.11.3" from the last few days.
>> 
>> Best regards,
>> Ole
>> 
>> On 4/5/19 4:28 PM, Julien Rey wrote:
>>> The failure occurs after a few minutes (~10).
>>> 
>>> And we are running out of space on the slurm controller. The mysql daemon is at 100% CPU usage all the time. This issue is becoming critical.
>>> 
>>> Le 05/04/2019 16:10, Paul Edmon a écrit :
>>>> Did it just time out, or did that failure happen immediately.  If immediate you may be in a situation where you are hitting a bug. It "should" be safe to upgrade to a later version of 15.08.*. There may be fixes in there related to that.  I would look at the changelog though just to see if there is any database work that was done.
>>>> 
>>>> -Paul Edmon-
>>>> 
>>>> On 4/5/19 9:05 AM, Julien Rey wrote:
>>>>> Hi Paul, thanks for your advice. Actually I already tried what you suggested. No matter what value do I put after PurgeJobAfter I always end up with the same error:
>>>>> 
>>>>> sacctmgr archive dump Directory=/home/joule/archives/ PurgeJobAfter=1days
>>>>> sacctmgr: error: slurmdbd: Getting response to message type 1459
>>>>> sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
>>>>> Problem dumping archive: Unspecified error
>>>>> 
>>>>> sacctmgr archive dump Directory=/home/joule/archives/ PurgeJobAfter=48months
>>>>> sacctmgr: error: slurmdbd: Getting response to message type 1459
>>>>> sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
>>>>> Problem dumping archive: Unspecified error
>>>>> 
>>>>> Has anyone tried to truncate tables by hand directly in the mysql command line ?
>>>>> 
>>>>> Le 04/04/2019 16:13, Paul Edmon a écrit :
>>>>>> We ran into this problem in the past.  I know that fixes were put in to deal with large purges as a result of our problems but I don't recall what version they ended up in, likely newer than 15.08.0.
>>>>>> 
>>>>>> A solution that can work is to walk up the time so that instead of one large purge you do several smaller purges. That at least worked for us in the past.
>>>>>> 
>>>>>> -Paul Edmon-
>>>>>> 
>>>>>> On 4/4/19 9:38 AM, Julien Rey wrote:
>>>>>>> Hello,
>>>>>>> 
>>>>>>> Our slurm accounting database is growing bigger and bigger (more than 100Gb) and is never being purged. We are running slurm 15.08.0-0pre1. I would like to upgrade to a more recent version of the slurmdbd, but my fear is that it may break everything during the update of the database.
>>>>>>> 
>>>>>>> Here is our slurmdbd.conf :
>>>>>>> 
>>>>>>> AuthType=auth/munge
>>>>>>> AuthInfo=/var/run/munge/munge.socket.2
>>>>>>> DbdHost=localhost
>>>>>>> DebugLevel=6
>>>>>>> StorageHost=localhost
>>>>>>> StorageLoc=slurm_acct_db
>>>>>>> StoragePass=shazaam
>>>>>>> StorageType=accounting_storage/mysql
>>>>>>> StorageUser=slurm
>>>>>>> LogFile=/var/log/slurm-llnl/slurmdbd.log
>>>>>>> PidFile=/var/run/slurm-llnl/slurmdbd.pid
>>>>>>> SlurmUser=slurm
>>>>>>> ArchiveDir=/home/joule/archives
>>>>>>> PurgeEventAfter=18
>>>>>>> PurgeJobAfter=18
>>>>>>> PurgeResvAfter=1
>>>>>>> PurgeStepAfter=1
>>>>>>> PurgeSuspendAfter=1
>>>>>>> 
>>>>>>> I tried to purge it manually using this command but the slurmdbd daemon ends up crashing and it doesn't remove anything:
>>>>>>> 
>>>>>>> sacctmgr archive dump Directory=/home/joule/archives/ PurgeJobAfter=365days
>>>>>>> 
>>>>>>> sacctmgr: error: slurmdbd: Getting response to message type 1459
>>>>>>> sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
>>>>>>> Problem dumping archive: Unspecified error
>>>>>>> 
>>>>>>> Sometimes I have to restart the mysql daemon (we are running mysql 5.5.39-1). The /var/log/slurm-llnl/slurmdbd.log shows nothings. The mysql logs are empty.
>>>>>>> 
>>>>>>> I tried to increase these values in my.cnf but so far no success :
>>>>>>> 
>>>>>>> innodb_buffer_pool_size        = 32G
>>>>>>> innodb_lock_wait_timeout    = 3600
>>>>>>> 
>>>>>>> Is there any way to solve this issue ? Otherwise, what would be the procedure for deleting the database records altogether and starting on a fresh new one ?
>> 
> 
> 
> -- 
> Julien REY
> 
> Plate-forme RPBS
> Modélisation Computationnelle des Interactions Protéines-Ligand (CMPLI)
> Université Paris Diderot - Paris VII
> tel : 01 57 27 83 95
> 
> 




More information about the slurm-users mailing list