[slurm-users] slurmdbd purge not working
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Fri Apr 5 14:43:08 UTC 2019
Hi Julien,
Did you optimize the MySQL database, in particular InnoDB?
I have collected some documentation in my Wiki page
https://wiki.fysik.dtu.dk/niflheim/Slurm_database#mysql-configuration
and I also discuss database purging.
Please note that we run Slurm 17.11 (and recently 18.08) on CentOS 7.6
systems which come with a MariaDB 5.5 database. We have no problems
with the database or the daily purging operations (see
https://wiki.fysik.dtu.dk/niflheim/Slurm_database#setting-database-purge-parameters).
If you are upgrading your Slurm version (or planning to do it), I also
recommend you to read the thread [slurm-users] "Extreme long db upgrade
16.05.6 -> 17.11.3" from the last few days.
Best regards,
Ole
On 4/5/19 4:28 PM, Julien Rey wrote:
> The failure occurs after a few minutes (~10).
>
> And we are running out of space on the slurm controller. The mysql
> daemon is at 100% CPU usage all the time. This issue is becoming critical.
>
> Le 05/04/2019 16:10, Paul Edmon a écrit :
>> Did it just time out, or did that failure happen immediately. If
>> immediate you may be in a situation where you are hitting a bug. It
>> "should" be safe to upgrade to a later version of 15.08.*. There may
>> be fixes in there related to that. I would look at the changelog
>> though just to see if there is any database work that was done.
>>
>> -Paul Edmon-
>>
>> On 4/5/19 9:05 AM, Julien Rey wrote:
>>> Hi Paul, thanks for your advice. Actually I already tried what you
>>> suggested. No matter what value do I put after PurgeJobAfter I always
>>> end up with the same error:
>>>
>>> sacctmgr archive dump Directory=/home/joule/archives/
>>> PurgeJobAfter=1days
>>> sacctmgr: error: slurmdbd: Getting response to message type 1459
>>> sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
>>> Problem dumping archive: Unspecified error
>>>
>>> sacctmgr archive dump Directory=/home/joule/archives/
>>> PurgeJobAfter=48months
>>> sacctmgr: error: slurmdbd: Getting response to message type 1459
>>> sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
>>> Problem dumping archive: Unspecified error
>>>
>>> Has anyone tried to truncate tables by hand directly in the mysql
>>> command line ?
>>>
>>> Le 04/04/2019 16:13, Paul Edmon a écrit :
>>>> We ran into this problem in the past. I know that fixes were put in
>>>> to deal with large purges as a result of our problems but I don't
>>>> recall what version they ended up in, likely newer than 15.08.0.
>>>>
>>>> A solution that can work is to walk up the time so that instead of
>>>> one large purge you do several smaller purges. That at least worked
>>>> for us in the past.
>>>>
>>>> -Paul Edmon-
>>>>
>>>> On 4/4/19 9:38 AM, Julien Rey wrote:
>>>>> Hello,
>>>>>
>>>>> Our slurm accounting database is growing bigger and bigger (more
>>>>> than 100Gb) and is never being purged. We are running slurm
>>>>> 15.08.0-0pre1. I would like to upgrade to a more recent version of
>>>>> the slurmdbd, but my fear is that it may break everything during
>>>>> the update of the database.
>>>>>
>>>>> Here is our slurmdbd.conf :
>>>>>
>>>>> AuthType=auth/munge
>>>>> AuthInfo=/var/run/munge/munge.socket.2
>>>>> DbdHost=localhost
>>>>> DebugLevel=6
>>>>> StorageHost=localhost
>>>>> StorageLoc=slurm_acct_db
>>>>> StoragePass=shazaam
>>>>> StorageType=accounting_storage/mysql
>>>>> StorageUser=slurm
>>>>> LogFile=/var/log/slurm-llnl/slurmdbd.log
>>>>> PidFile=/var/run/slurm-llnl/slurmdbd.pid
>>>>> SlurmUser=slurm
>>>>> ArchiveDir=/home/joule/archives
>>>>> PurgeEventAfter=18
>>>>> PurgeJobAfter=18
>>>>> PurgeResvAfter=1
>>>>> PurgeStepAfter=1
>>>>> PurgeSuspendAfter=1
>>>>>
>>>>> I tried to purge it manually using this command but the slurmdbd
>>>>> daemon ends up crashing and it doesn't remove anything:
>>>>>
>>>>> sacctmgr archive dump Directory=/home/joule/archives/
>>>>> PurgeJobAfter=365days
>>>>>
>>>>> sacctmgr: error: slurmdbd: Getting response to message type 1459
>>>>> sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error
>>>>> Problem dumping archive: Unspecified error
>>>>>
>>>>> Sometimes I have to restart the mysql daemon (we are running mysql
>>>>> 5.5.39-1). The /var/log/slurm-llnl/slurmdbd.log shows nothings. The
>>>>> mysql logs are empty.
>>>>>
>>>>> I tried to increase these values in my.cnf but so far no success :
>>>>>
>>>>> innodb_buffer_pool_size = 32G
>>>>> innodb_lock_wait_timeout = 3600
>>>>>
>>>>> Is there any way to solve this issue ? Otherwise, what would be the
>>>>> procedure for deleting the database records altogether and starting
>>>>> on a fresh new one ?
More information about the slurm-users
mailing list