[slurm-users] Job cancelled into the future

Tue Dec 20 23:01:41 UTC 2022

Seems like the time may have been off on the db server at the insert/update.

You may want to dump the database, find what table/records need updated 
and try updating them. If anything went south, you could restore from 
the dump.

Brian Andrus

On 12/20/2022 11:51 AM, Reed Dier wrote:
> Just to followup with some things I’ve tried:
>
> scancel doesn’t want to touch it:
>> # scancel -v 290710
>> scancel: Terminating job 290710
>> scancel: error: Kill job error on job id 290710: Job/step already 
>> completing or completed
>
> pscontrol does see that these are all members of the same array, but 
> doesn’t want to touch it:
>> # scontrol update JobID=290710 EndTime=2022-08-09T08:47:01
>> 290710_4,6,26,32,60,67,83,87,89,91,...: Job has already finished
>
> And trying to modify the job’s end time with sacctmgr fails, as 
> expected, to modify the EndTime because EndTime is only a where spec, 
> not a set spec, also tried EndTime=now with same results:
>> # sacctmgr modify job where JobID=290710 set EndTime=2022-08-09T08:47:01
>>  Unknown option: EndTime=2022-08-09T08:47:01
>>  Use keyword 'where' to modify condition
>>  You didn't give me anything to set
>
> I was able to set a comment for the jobs/array, so the DBD can 
> see/talk to them.
> One additional thing to mention is that there are 14 JIDs that are 
> stuck like this, 1 is an Array JID, and 13 of them are array tasks on 
> the original Array ID.
>
> But figured I would provide some of the other steps I’ve tried to 
> flush those ideas.
>
> Thanks,
> Reed
>
>> On Dec 20, 2022, at 10:08 AM, Reed Dier <reed.dier at focusvq.com> wrote:
>>
>> 2 votes for runawayjobs is a strong vote (and also something I’m glad 
>> to learn exists for the future), however, it does not appear to be 
>> the case.
>>
>>> # sacctmgr show runawayjobs
>>> Runaway Jobs: No runaway jobs found on cluster $cluster
>>
>> So unfortunately that doesn’t appear to be the culprit.
>>
>> Appreciate the responses.
>>
>> Reed
>>
>>> On Dec 20, 2022, at 10:03 AM, Brian Andrus <toomuchit at gmail.com> wrote:
>>>
>>> Try:
>>>
>>>     sacctmgr list runawayjobs
>>>
>>> Brian Andrus
>>>
>>> On 12/20/2022 7:54 AM, Reed Dier wrote:
>>>> Hoping this is a fairly simple one.
>>>>
>>>> This is a small internal cluster that we’ve been using for about 6 
>>>> months now, and we’ve had some infrastructure instability in that 
>>>> time, which I think may be the root culprit behind this weirdness, 
>>>> but hopefully someone can point me in the direction to solve the issue.
>>>>
>>>> I do a daily email of sreport to show how busy the cluster was, and 
>>>> who were the top users.
>>>> Weirdly, I have a user that seems to be able to use the same exact 
>>>> usage day after day after day, down to hundredth of a percent, 
>>>> conspicuously even when they were on vacation and claimed that they 
>>>> didn’t have job submissions in cron/etc.
>>>>
>>>> So then, taking a spin of the scom tui 
>>>> <https://lists.schedmd.com/pipermail/slurm-users/2022-December/009514.html>posted 
>>>> this morning, I then filtered that user, and noticed that even 
>>>> though I was only looking 2 days back at job history, I was seeing 
>>>> a job from August.
>>>>
>>>> Conspicuously, the job state is cancelled, but the job end time is 
>>>> 1y from the start time, meaning its job end time is in 2023.
>>>> So something with the dbd is confused about this/these jobs that 
>>>> are lingering and reporting cancelled but still “on the books” 
>>>> somehow until next August.
>>>>
>>>>> ╭──────────────────────────────────────────────────────────────────────────────────────────╮
>>>>> │                                │
>>>>> │  Job ID : 290742                               │
>>>>> │  Job Name : $jobname                               │
>>>>> │  User : $user                                │
>>>>> │  Group  : $user                                │
>>>>> │  Job Account  : $account                                 │
>>>>> │  Job Submission : 2022-08-08 08:44:52 -0400 EDT                 
>>>>>                │
>>>>> │  Job Start  : 2022-08-08 08:46:53 -0400 EDT                     
>>>>>            │
>>>>> │  Job End  : 2023-08-08 08:47:01 -0400 EDT                       
>>>>>          │
>>>>> │  Job Wait time  : 2m1s                                 │
>>>>> │  Job Run time : 8760h0m8s                                │
>>>>> │  Partition  : $part                                │
>>>>> │  Priority : 127282                               │
>>>>> │  QoS  : $qos                                 │
>>>>> │                                │
>>>>> │                                │
>>>>> ╰──────────────────────────────────────────────────────────────────────────────────────────╯
>>>>> Steps count: 0
>>>>
>>>>> Filter: $user       Items: 13
>>>>>
>>>>>  Job ID      Job Name                           Part.  QoS         
>>>>> Account     User Nodes                 State
>>>>> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>>>>>  290714      $jobname                           $part  $qos       
>>>>>  $acct       $user      node32                CANCELLED
>>>>>  290716      $jobname                           $part  $qos       
>>>>>  $acct       $user      node24                CANCELLED
>>>>>  290736      $jobname                           $part  $qos       
>>>>>  $acct       $user      node00                CANCELLED
>>>>>  290742      $jobname                           $part  $qos       
>>>>>  $acct       $user      node01                CANCELLED
>>>>>  290770      $jobname                           $part  $qos       
>>>>>  $acct       $user      node02                CANCELLED
>>>>>  290777      $jobname                           $part  $qos       
>>>>>  $acct       $user      node03                CANCELLED
>>>>>  290793      $jobname                           $part  $qos       
>>>>>  $acct       $user      node04                CANCELLED
>>>>>  290797      $jobname                           $part  $qos       
>>>>>  $acct       $user      node05                CANCELLED
>>>>>  290799      $jobname                           $part  $qos       
>>>>>  $acct       $user      node06                CANCELLED
>>>>>  290801      $jobname                           $part  $qos       
>>>>>  $acct       $user      node07                CANCELLED
>>>>>  290814      $jobname                           $part  $qos       
>>>>>  $acct       $user      node08                CANCELLED
>>>>>  290817      $jobname                           $part  $qos       
>>>>>  $acct       $user      node09                CANCELLED
>>>>>  290819      $jobname                           $part  $qos       
>>>>>  $acct       $user      node10                CANCELLED
>>>>
>>>> I’d love to figure out the proper way to either purge these jid’s 
>>>> from the accounting database cleanly, or change the job end/run 
>>>> time to a sane/correct value.
>>>> Slurm is v21.08.8-2, and ntp is a stratum 1 server, so time is in 
>>>> sync everywhere, not that multiple servers would drift 1 year off 
>>>> like this.
>>>>
>>>> Thanks for any help,
>>>> Reed
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221220/c56541fe/attachment-0001.htm>