[slurm-users] sacct end time for failed jobs

Paul Edmon pedmon at cfa.harvard.edu
Wed Mar 6 18:32:04 UTC 2019


Odds are the new version won't help for that.  You will have to do some 
mysql work to fix it then.

-Paul Edmon-

On 3/6/2019 1:23 PM, Brian Andrus wrote:
>
> I am running the latest and did that, but it didn't change anything. 
> The jobs stay in the runaway state and no changes are made to the 
> database.
>
> Using 18.08.2-1.
>
> Maybe try updating to 19.05.0-0pre1?
>
> Brian
>
>
> On 3/6/2019 10:06 AM, Paul Edmon wrote:
>>
>> A lot of this is automated in the new versions of slurm.  You should 
>> just need to run:
>>
>> sacctmgr show runawayjobs
>>
>> It will then give you an option to clean them and slurm will handle 
>> the rest.  If you add the -i option it will just clean them 
>> automatically.
>>
>> -Paul Edmon-
>>
>> On 3/6/2019 11:58 AM, Cyrus Proctor wrote:
>>>
>>> Hi Brian,
>>>
>>> Others probably have better suggestions before going the route I'm 
>>> about to detail. If you do go this route, be warned, you definitely 
>>> have the ability to irrevocably lose data or destroy your Slurm 
>>> accounting database. Do so at your own risk. I got here with 
>>> Google-foo after being out of other (known to me) options. Someone 
>>> please save Brian having to do what comes below ;-)
>>>
>>> Last warning: I'd recommend turning off slurmdbd and backing up the 
>>> database (mysqldump) before going forward.
>>>
>>> In my case, runaway jobs did not show up with `sacctmgr list 
>>> runawayjobs`. My problem was removing a user from the Slurm database 
>>> because it thought they still had active jobs. The likely cause of 
>>> this was the slurmdb daemon not shutting down gracefully at some 
>>> point. The job was long gone but it was still in a pending state:
>>>
>>> # sacct -j 899139
>>>         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
>>> ------------ ---------- ---------- ---------- ---------- ---------- --------
>>> 899139            equil   gpu-long    p-1234         20    PENDING      0:0
>>> # scontrol show job 899139
>>> slurm_load_jobs error: Invalid job id specified
>>> # mysql -u root -p
>>> ...
>>> Welcome to the MySQL monitor.  Commands end with ; or \g.
>>> Your MySQL connection id is 7453
>>> Server version: 5.1.73 Source distribution
>>>
>>> Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.
>>>
>>> Oracle is a registered trademark of Oracle Corporation and/or its
>>> affiliates. Other names may be trademarks of their respective
>>> owners.
>>>
>>> Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
>>>
>>> mysql> use slurm_acct_db;
>>> Reading table information for completion of table and column names
>>> You can turn off this feature to get a quicker startup with -A
>>>
>>> Database changed
>>> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
>>> +-------+----------+------------+-------------+----------+-----------+
>>> | state | time_end | time_start | time_submit | id_assoc | partition |
>>> +-------+----------+------------+-------------+----------+-----------+
>>> |     0 |        0 |          0 |  1546880711 |     2078 | gpu-long  |
>>> +-------+----------+------------+-------------+----------+-----------+
>>> 1 row in set (0.00 sec)
>>>
>>> mysql> update banana_job_table set state=3 where id_job=899139;
>>> Query OK, 1 row affected (0.00 sec)
>>> Rows matched: 1  Changed: 1  Warnings: 0
>>>
>>> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
>>> +-------+----------+------------+-------------+----------+-----------+
>>> | state | time_end | time_start | time_submit | id_assoc | partition |
>>> +-------+----------+------------+-------------+----------+-----------+
>>> |     3 |        0 |          0 |  1546880711 |     2078 | gpu-long  |
>>> +-------+----------+------------+-------------+----------+-----------+
>>> 1 row in set (0.00 sec)
>>>
>>> mysql> update banana_job_table set time_start=1546880712 where id_job=899139;
>>> Query OK, 1 row affected (0.00 sec)
>>> Rows matched: 1  Changed: 1  Warnings: 0
>>>
>>> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
>>> +-------+----------+------------+-------------+----------+-----------+
>>> | state | time_end | time_start | time_submit | id_assoc | partition |
>>> +-------+----------+------------+-------------+----------+-----------+
>>> |     3 |        0 | 1546880712 |  1546880711 |     2078 | gpu-long  |
>>> +-------+----------+------------+-------------+----------+-----------+
>>> 1 row in set (0.00 sec)
>>>
>>> mysql> update banana_job_table set time_end=1546880713 where id_job=899139;
>>> Query OK, 1 row affected (0.01 sec)
>>> Rows matched: 1  Changed: 1  Warnings: 0
>>>
>>> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
>>> +-------+------------+------------+-------------+----------+-----------+
>>> | state | time_end   | time_start | time_submit | id_assoc | partition |
>>> +-------+------------+------------+-------------+----------+-----------+
>>> |     3 | 1546880713 | 1546880712 |  1546880711 |     2078 | gpu-long  |
>>> +-------+------------+------------+-------------+----------+-----------+
>>> 1 row in set (0.00 sec)
>>> In this case for job ID 899139 on the banana cluster, the state was 
>>> not updated and neither were start or end times. I went in and 
>>> manually edited the job entries such that Slurm thought they were 
>>> complete with feasible start and end times. Again, this worked for 
>>> me. I don't know if this is your problem or not. If you choose this 
>>> route, be careful and good luck!
>>>
>>> On 3/6/19 10:15 AM, Brian Andrus wrote:
>>>>
>>>> It shows several jobs that all have "Unknown" for end_time. Some 
>>>> are PENDING and some are RUNNING (none are truly in either state).
>>>>
>>>> It asked to fix them, which I did, but nothing seems to have 
>>>> changed. They still show up with that command and in reports.
>>>>
>>>>
>>>> Brian
>>>>
>>>> On 3/5/2019 10:34 PM, Chris Samuel wrote:
>>>>> On Tuesday, 5 March 2019 10:07:30 AM PST Brian Andrus wrote:
>>>>>
>>>>>> Does anyone have a process they use to handle empty (aka 
>>>>>> "Unknown") end
>>>>>> times for jobs that are not running?
>>>>> What does:
>>>>>
>>>>> sacctmgr list runawayjobs
>>>>>
>>>>> say?
>>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190306/0905cac6/attachment-0001.html>


More information about the slurm-users mailing list