[slurm-users] sacct end time for failed jobs
Brian Andrus
toomuchit at gmail.com
Wed Mar 6 18:23:25 UTC 2019
I am running the latest and did that, but it didn't change anything. The
jobs stay in the runaway state and no changes are made to the database.
Using 18.08.2-1.
Maybe try updating to 19.05.0-0pre1?
Brian
On 3/6/2019 10:06 AM, Paul Edmon wrote:
>
> A lot of this is automated in the new versions of slurm. You should
> just need to run:
>
> sacctmgr show runawayjobs
>
> It will then give you an option to clean them and slurm will handle
> the rest. If you add the -i option it will just clean them automatically.
>
> -Paul Edmon-
>
> On 3/6/2019 11:58 AM, Cyrus Proctor wrote:
>>
>> Hi Brian,
>>
>> Others probably have better suggestions before going the route I'm
>> about to detail. If you do go this route, be warned, you definitely
>> have the ability to irrevocably lose data or destroy your Slurm
>> accounting database. Do so at your own risk. I got here with
>> Google-foo after being out of other (known to me) options. Someone
>> please save Brian having to do what comes below ;-)
>>
>> Last warning: I'd recommend turning off slurmdbd and backing up the
>> database (mysqldump) before going forward.
>>
>> In my case, runaway jobs did not show up with `sacctmgr list
>> runawayjobs`. My problem was removing a user from the Slurm database
>> because it thought they still had active jobs. The likely cause of
>> this was the slurmdb daemon not shutting down gracefully at some
>> point. The job was long gone but it was still in a pending state:
>>
>> # sacct -j 899139
>> JobID JobName Partition Account AllocCPUS State ExitCode
>> ------------ ---------- ---------- ---------- ---------- ---------- --------
>> 899139 equil gpu-long p-1234 20 PENDING 0:0
>> # scontrol show job 899139
>> slurm_load_jobs error: Invalid job id specified
>> # mysql -u root -p
>> ...
>> Welcome to the MySQL monitor. Commands end with ; or \g.
>> Your MySQL connection id is 7453
>> Server version: 5.1.73 Source distribution
>>
>> Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.
>>
>> Oracle is a registered trademark of Oracle Corporation and/or its
>> affiliates. Other names may be trademarks of their respective
>> owners.
>>
>> Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
>>
>> mysql> use slurm_acct_db;
>> Reading table information for completion of table and column names
>> You can turn off this feature to get a quicker startup with -A
>>
>> Database changed
>> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
>> +-------+----------+------------+-------------+----------+-----------+
>> | state | time_end | time_start | time_submit | id_assoc | partition |
>> +-------+----------+------------+-------------+----------+-----------+
>> | 0 | 0 | 0 | 1546880711 | 2078 | gpu-long |
>> +-------+----------+------------+-------------+----------+-----------+
>> 1 row in set (0.00 sec)
>>
>> mysql> update banana_job_table set state=3 where id_job=899139;
>> Query OK, 1 row affected (0.00 sec)
>> Rows matched: 1 Changed: 1 Warnings: 0
>>
>> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
>> +-------+----------+------------+-------------+----------+-----------+
>> | state | time_end | time_start | time_submit | id_assoc | partition |
>> +-------+----------+------------+-------------+----------+-----------+
>> | 3 | 0 | 0 | 1546880711 | 2078 | gpu-long |
>> +-------+----------+------------+-------------+----------+-----------+
>> 1 row in set (0.00 sec)
>>
>> mysql> update banana_job_table set time_start=1546880712 where id_job=899139;
>> Query OK, 1 row affected (0.00 sec)
>> Rows matched: 1 Changed: 1 Warnings: 0
>>
>> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
>> +-------+----------+------------+-------------+----------+-----------+
>> | state | time_end | time_start | time_submit | id_assoc | partition |
>> +-------+----------+------------+-------------+----------+-----------+
>> | 3 | 0 | 1546880712 | 1546880711 | 2078 | gpu-long |
>> +-------+----------+------------+-------------+----------+-----------+
>> 1 row in set (0.00 sec)
>>
>> mysql> update banana_job_table set time_end=1546880713 where id_job=899139;
>> Query OK, 1 row affected (0.01 sec)
>> Rows matched: 1 Changed: 1 Warnings: 0
>>
>> mysql> select state,time_end,time_start,time_submit,id_assoc,partition from banana_job_table where id_job=899139;
>> +-------+------------+------------+-------------+----------+-----------+
>> | state | time_end | time_start | time_submit | id_assoc | partition |
>> +-------+------------+------------+-------------+----------+-----------+
>> | 3 | 1546880713 | 1546880712 | 1546880711 | 2078 | gpu-long |
>> +-------+------------+------------+-------------+----------+-----------+
>> 1 row in set (0.00 sec)
>> In this case for job ID 899139 on the banana cluster, the state was
>> not updated and neither were start or end times. I went in and
>> manually edited the job entries such that Slurm thought they were
>> complete with feasible start and end times. Again, this worked for
>> me. I don't know if this is your problem or not. If you choose this
>> route, be careful and good luck!
>>
>> On 3/6/19 10:15 AM, Brian Andrus wrote:
>>>
>>> It shows several jobs that all have "Unknown" for end_time. Some are
>>> PENDING and some are RUNNING (none are truly in either state).
>>>
>>> It asked to fix them, which I did, but nothing seems to have
>>> changed. They still show up with that command and in reports.
>>>
>>>
>>> Brian
>>>
>>> On 3/5/2019 10:34 PM, Chris Samuel wrote:
>>>> On Tuesday, 5 March 2019 10:07:30 AM PST Brian Andrus wrote:
>>>>
>>>>> Does anyone have a process they use to handle empty (aka
>>>>> "Unknown") end
>>>>> times for jobs that are not running?
>>>> What does:
>>>>
>>>> sacctmgr list runawayjobs
>>>>
>>>> say?
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190306/9a669544/attachment.html>
More information about the slurm-users
mailing list