[slurm-users] sacct end time for failed jobs

Paul Edmon pedmon at cfa.harvard.edu
Wed Mar 6 18:06:07 UTC 2019


A lot of this is automated in the new versions of slurm.  You should 
just need to run:

sacctmgr show runawayjobs

It will then give you an option to clean them and slurm will handle the 
rest.  If you add the -i option it will just clean them automatically.

-Paul Edmon-

On 3/6/2019 11:58 AM, Cyrus Proctor wrote:
>
> Hi Brian,
>
> Others probably have better suggestions before going the route I'm 
> about to detail. If you do go this route, be warned, you definitely 
> have the ability to irrevocably lose data or destroy your Slurm 
> accounting database. Do so at your own risk. I got here with 
> Google-foo after being out of other (known to me) options. Someone 
> please save Brian having to do what comes below ;-)
>
> Last warning: I'd recommend turning off slurmdbd and backing up the 
> database (mysqldump) before going forward.
>
> In my case, runaway jobs did not show up with `sacctmgr list 
> runawayjobs`. My problem was removing a user from the Slurm database 
> because it thought they still had active jobs. The likely cause of 
> this was the slurmdb daemon not shutting down gracefully at some 
> point. The job was long gone but it was still in a pending state:
>
> # sacct -j 899139
>         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
> ------------ ---------- ---------- ---------- ---------- ---------- --------
> 899139            equil   gpu-long    p-1234         20    PENDING      0:0
> # scontrol show job 899139
> slurm_load_jobs error: Invalid job id specified
> # mysql -u root -p
> ...
> Welcome to the MySQL monitor.  Commands end with ; or \g.
> Your MySQL connection id is 7453
> Server version: 5.1.73 Source distribution
>
> Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.
>
> Oracle is a registered trademark of Oracle Corporation and/or its
> affiliates. Other names may be trademarks of their respective
> owners.
>
> Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
>
> mysql> use slurm_acct_db;
> Reading table information for completion of table and column names
> You can turn off this feature to get a quicker startup with -A
>
> Database changed
> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
> +-------+----------+------------+-------------+----------+-----------+
> | state | time_end | time_start | time_submit | id_assoc | partition |
> +-------+----------+------------+-------------+----------+-----------+
> |     0 |        0 |          0 |  1546880711 |     2078 | gpu-long  |
> +-------+----------+------------+-------------+----------+-----------+
> 1 row in set (0.00 sec)
>
> mysql> update banana_job_table set state=3 where id_job=899139;
> Query OK, 1 row affected (0.00 sec)
> Rows matched: 1  Changed: 1  Warnings: 0
>
> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
> +-------+----------+------------+-------------+----------+-----------+
> | state | time_end | time_start | time_submit | id_assoc | partition |
> +-------+----------+------------+-------------+----------+-----------+
> |     3 |        0 |          0 |  1546880711 |     2078 | gpu-long  |
> +-------+----------+------------+-------------+----------+-----------+
> 1 row in set (0.00 sec)
>
> mysql> update banana_job_table set time_start=1546880712 where id_job=899139;
> Query OK, 1 row affected (0.00 sec)
> Rows matched: 1  Changed: 1  Warnings: 0
>
> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
> +-------+----------+------------+-------------+----------+-----------+
> | state | time_end | time_start | time_submit | id_assoc | partition |
> +-------+----------+------------+-------------+----------+-----------+
> |     3 |        0 | 1546880712 |  1546880711 |     2078 | gpu-long  |
> +-------+----------+------------+-------------+----------+-----------+
> 1 row in set (0.00 sec)
>
> mysql> update banana_job_table set time_end=1546880713 where id_job=899139;
> Query OK, 1 row affected (0.01 sec)
> Rows matched: 1  Changed: 1  Warnings: 0
>
> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
> +-------+------------+------------+-------------+----------+-----------+
> | state | time_end   | time_start | time_submit | id_assoc | partition |
> +-------+------------+------------+-------------+----------+-----------+
> |     3 | 1546880713 | 1546880712 |  1546880711 |     2078 | gpu-long  |
> +-------+------------+------------+-------------+----------+-----------+
> 1 row in set (0.00 sec)
> In this case for job ID 899139 on the banana cluster, the state was 
> not updated and neither were start or end times. I went in and 
> manually edited the job entries such that Slurm thought they were 
> complete with feasible start and end times. Again, this worked for me. 
> I don't know if this is your problem or not. If you choose this route, 
> be careful and good luck!
>
> On 3/6/19 10:15 AM, Brian Andrus wrote:
>>
>> It shows several jobs that all have "Unknown" for end_time. Some are 
>> PENDING and some are RUNNING (none are truly in either state).
>>
>> It asked to fix them, which I did, but nothing seems to have changed. 
>> They still show up with that command and in reports.
>>
>>
>> Brian
>>
>> On 3/5/2019 10:34 PM, Chris Samuel wrote:
>>> On Tuesday, 5 March 2019 10:07:30 AM PST Brian Andrus wrote:
>>>
>>>> Does anyone have a process they use to handle empty (aka "Unknown") 
>>>> end
>>>> times for jobs that are not running?
>>> What does:
>>>
>>> sacctmgr list runawayjobs
>>>
>>> say?
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190306/e8385e1f/attachment-0001.html>


More information about the slurm-users mailing list