[slurm-users] slurmdbd purge not working

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Fri Apr 5 14:43:08 UTC 2019


Hi Julien,

Did you optimize the MySQL database, in particular InnoDB?

I have collected some documentation in my Wiki page 
https://wiki.fysik.dtu.dk/niflheim/Slurm_database#mysql-configuration
and I also discuss database purging.

Please note that we run Slurm 17.11 (and recently 18.08) on CentOS 7.6 
systems which come with a MariaDB 5.5 database.  We have no problems 
with the database or the daily purging operations (see 
https://wiki.fysik.dtu.dk/niflheim/Slurm_database#setting-database-purge-parameters).

If you are upgrading your Slurm version (or planning to do it), I also 
recommend you to read the thread [slurm-users] "Extreme long db upgrade 
16.05.6 -> 17.11.3" from the last few days.

Best regards,
Ole

On 4/5/19 4:28 PM, Julien Rey wrote:
> The failure occurs after a few minutes (~10).
> 
> And we are running out of space on the slurm controller. The mysql 
> daemon is at 100% CPU usage all the time. This issue is becoming critical.
> 
> Le 05/04/2019 16:10, Paul Edmon a écrit :
>> Did it just time out, or did that failure happen immediately.  If 
>> immediate you may be in a situation where you are hitting a bug. It 
>> "should" be safe to upgrade to a later version of 15.08.*. There may 
>> be fixes in there related to that.  I would look at the changelog 
>> though just to see if there is any database work that was done.
>>
>> -Paul Edmon-
>>
>> On 4/5/19 9:05 AM, Julien Rey wrote:
>>> Hi Paul, thanks for your advice. Actually I already tried what you 
>>> suggested. No matter what value do I put after PurgeJobAfter I always 
>>> end up with the same error:
>>>
>>> sacctmgr archive dump Directory=/home/joule/archives/ 
>>> PurgeJobAfter=1days
>>> sacctmgr: error: slurmdbd: Getting response to message type 1459
>>> sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
>>>  Problem dumping archive: Unspecified error
>>>
>>> sacctmgr archive dump Directory=/home/joule/archives/ 
>>> PurgeJobAfter=48months
>>> sacctmgr: error: slurmdbd: Getting response to message type 1459
>>> sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
>>>  Problem dumping archive: Unspecified error
>>>
>>> Has anyone tried to truncate tables by hand directly in the mysql 
>>> command line ?
>>>
>>> Le 04/04/2019 16:13, Paul Edmon a écrit :
>>>> We ran into this problem in the past.  I know that fixes were put in 
>>>> to deal with large purges as a result of our problems but I don't 
>>>> recall what version they ended up in, likely newer than 15.08.0.
>>>>
>>>> A solution that can work is to walk up the time so that instead of 
>>>> one large purge you do several smaller purges. That at least worked 
>>>> for us in the past.
>>>>
>>>> -Paul Edmon-
>>>>
>>>> On 4/4/19 9:38 AM, Julien Rey wrote:
>>>>> Hello,
>>>>>
>>>>> Our slurm accounting database is growing bigger and bigger (more 
>>>>> than 100Gb) and is never being purged. We are running slurm 
>>>>> 15.08.0-0pre1. I would like to upgrade to a more recent version of 
>>>>> the slurmdbd, but my fear is that it may break everything during 
>>>>> the update of the database.
>>>>>
>>>>> Here is our slurmdbd.conf :
>>>>>
>>>>> AuthType=auth/munge
>>>>> AuthInfo=/var/run/munge/munge.socket.2
>>>>> DbdHost=localhost
>>>>> DebugLevel=6
>>>>> StorageHost=localhost
>>>>> StorageLoc=slurm_acct_db
>>>>> StoragePass=shazaam
>>>>> StorageType=accounting_storage/mysql
>>>>> StorageUser=slurm
>>>>> LogFile=/var/log/slurm-llnl/slurmdbd.log
>>>>> PidFile=/var/run/slurm-llnl/slurmdbd.pid
>>>>> SlurmUser=slurm
>>>>> ArchiveDir=/home/joule/archives
>>>>> PurgeEventAfter=18
>>>>> PurgeJobAfter=18
>>>>> PurgeResvAfter=1
>>>>> PurgeStepAfter=1
>>>>> PurgeSuspendAfter=1
>>>>>
>>>>> I tried to purge it manually using this command but the slurmdbd 
>>>>> daemon ends up crashing and it doesn't remove anything:
>>>>>
>>>>> sacctmgr archive dump Directory=/home/joule/archives/ 
>>>>> PurgeJobAfter=365days
>>>>>
>>>>> sacctmgr: error: slurmdbd: Getting response to message type 1459
>>>>> sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
>>>>>  Problem dumping archive: Unspecified error
>>>>>
>>>>> Sometimes I have to restart the mysql daemon (we are running mysql 
>>>>> 5.5.39-1). The /var/log/slurm-llnl/slurmdbd.log shows nothings. The 
>>>>> mysql logs are empty.
>>>>>
>>>>> I tried to increase these values in my.cnf but so far no success :
>>>>>
>>>>> innodb_buffer_pool_size        = 32G
>>>>> innodb_lock_wait_timeout    = 3600
>>>>>
>>>>> Is there any way to solve this issue ? Otherwise, what would be the 
>>>>> procedure for deleting the database records altogether and starting 
>>>>> on a fresh new one ?



More information about the slurm-users mailing list