[slurm-users] sacct end time for failed jobs

Brian Andrus toomuchit at gmail.com
Wed Mar 6 18:23:25 UTC 2019


I am running the latest and did that, but it didn't change anything. The 
jobs stay in the runaway state and no changes are made to the database.

Using 18.08.2-1.

Maybe try updating to 19.05.0-0pre1?

Brian


On 3/6/2019 10:06 AM, Paul Edmon wrote:
>
> A lot of this is automated in the new versions of slurm.  You should 
> just need to run:
>
> sacctmgr show runawayjobs
>
> It will then give you an option to clean them and slurm will handle 
> the rest.  If you add the -i option it will just clean them automatically.
>
> -Paul Edmon-
>
> On 3/6/2019 11:58 AM, Cyrus Proctor wrote:
>>
>> Hi Brian,
>>
>> Others probably have better suggestions before going the route I'm 
>> about to detail. If you do go this route, be warned, you definitely 
>> have the ability to irrevocably lose data or destroy your Slurm 
>> accounting database. Do so at your own risk. I got here with 
>> Google-foo after being out of other (known to me) options. Someone 
>> please save Brian having to do what comes below ;-)
>>
>> Last warning: I'd recommend turning off slurmdbd and backing up the 
>> database (mysqldump) before going forward.
>>
>> In my case, runaway jobs did not show up with `sacctmgr list 
>> runawayjobs`. My problem was removing a user from the Slurm database 
>> because it thought they still had active jobs. The likely cause of 
>> this was the slurmdb daemon not shutting down gracefully at some 
>> point. The job was long gone but it was still in a pending state:
>>
>> # sacct -j 899139
>>         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
>> ------------ ---------- ---------- ---------- ---------- ---------- --------
>> 899139            equil   gpu-long    p-1234         20    PENDING      0:0
>> # scontrol show job 899139
>> slurm_load_jobs error: Invalid job id specified
>> # mysql -u root -p
>> ...
>> Welcome to the MySQL monitor.  Commands end with ; or \g.
>> Your MySQL connection id is 7453
>> Server version: 5.1.73 Source distribution
>>
>> Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.
>>
>> Oracle is a registered trademark of Oracle Corporation and/or its
>> affiliates. Other names may be trademarks of their respective
>> owners.
>>
>> Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
>>
>> mysql> use slurm_acct_db;
>> Reading table information for completion of table and column names
>> You can turn off this feature to get a quicker startup with -A
>>
>> Database changed
>> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
>> +-------+----------+------------+-------------+----------+-----------+
>> | state | time_end | time_start | time_submit | id_assoc | partition |
>> +-------+----------+------------+-------------+----------+-----------+
>> |     0 |        0 |          0 |  1546880711 |     2078 | gpu-long  |
>> +-------+----------+------------+-------------+----------+-----------+
>> 1 row in set (0.00 sec)
>>
>> mysql> update banana_job_table set state=3 where id_job=899139;
>> Query OK, 1 row affected (0.00 sec)
>> Rows matched: 1  Changed: 1  Warnings: 0
>>
>> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
>> +-------+----------+------------+-------------+----------+-----------+
>> | state | time_end | time_start | time_submit | id_assoc | partition |
>> +-------+----------+------------+-------------+----------+-----------+
>> |     3 |        0 |          0 |  1546880711 |     2078 | gpu-long  |
>> +-------+----------+------------+-------------+----------+-----------+
>> 1 row in set (0.00 sec)
>>
>> mysql> update banana_job_table set time_start=1546880712 where id_job=899139;
>> Query OK, 1 row affected (0.00 sec)
>> Rows matched: 1  Changed: 1  Warnings: 0
>>
>> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
>> +-------+----------+------------+-------------+----------+-----------+
>> | state | time_end | time_start | time_submit | id_assoc | partition |
>> +-------+----------+------------+-------------+----------+-----------+
>> |     3 |        0 | 1546880712 |  1546880711 |     2078 | gpu-long  |
>> +-------+----------+------------+-------------+----------+-----------+
>> 1 row in set (0.00 sec)
>>
>> mysql> update banana_job_table set time_end=1546880713 where id_job=899139;
>> Query OK, 1 row affected (0.01 sec)
>> Rows matched: 1  Changed: 1  Warnings: 0
>>
>> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
>> +-------+------------+------------+-------------+----------+-----------+
>> | state | time_end   | time_start | time_submit | id_assoc | partition |
>> +-------+------------+------------+-------------+----------+-----------+
>> |     3 | 1546880713 | 1546880712 |  1546880711 |     2078 | gpu-long  |
>> +-------+------------+------------+-------------+----------+-----------+
>> 1 row in set (0.00 sec)
>> In this case for job ID 899139 on the banana cluster, the state was 
>> not updated and neither were start or end times. I went in and 
>> manually edited the job entries such that Slurm thought they were 
>> complete with feasible start and end times. Again, this worked for 
>> me. I don't know if this is your problem or not. If you choose this 
>> route, be careful and good luck!
>>
>> On 3/6/19 10:15 AM, Brian Andrus wrote:
>>>
>>> It shows several jobs that all have "Unknown" for end_time. Some are 
>>> PENDING and some are RUNNING (none are truly in either state).
>>>
>>> It asked to fix them, which I did, but nothing seems to have 
>>> changed. They still show up with that command and in reports.
>>>
>>>
>>> Brian
>>>
>>> On 3/5/2019 10:34 PM, Chris Samuel wrote:
>>>> On Tuesday, 5 March 2019 10:07:30 AM PST Brian Andrus wrote:
>>>>
>>>>> Does anyone have a process they use to handle empty (aka 
>>>>> "Unknown") end
>>>>> times for jobs that are not running?
>>>> What does:
>>>>
>>>> sacctmgr list runawayjobs
>>>>
>>>> say?
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190306/9a669544/attachment.html>


More information about the slurm-users mailing list