[slurm-users] Job cancelled into the future

Brian Andrus toomuchit at gmail.com
Tue Dec 20 16:03:44 UTC 2022


Try:

     sacctmgr list runawayjobs

Brian Andrus

On 12/20/2022 7:54 AM, Reed Dier wrote:
> Hoping this is a fairly simple one.
>
> This is a small internal cluster that we’ve been using for about 6 
> months now, and we’ve had some infrastructure instability in that 
> time, which I think may be the root culprit behind this weirdness, but 
> hopefully someone can point me in the direction to solve the issue.
>
> I do a daily email of sreport to show how busy the cluster was, and 
> who were the top users.
> Weirdly, I have a user that seems to be able to use the same exact 
> usage day after day after day, down to hundredth of a percent, 
> conspicuously even when they were on vacation and claimed that they 
> didn’t have job submissions in cron/etc.
>
> So then, taking a spin of the scom tui 
> <https://lists.schedmd.com/pipermail/slurm-users/2022-December/009514.html>posted 
> this morning, I then filtered that user, and noticed that even though 
> I was only looking 2 days back at job history, I was seeing a job from 
> August.
>
> Conspicuously, the job state is cancelled, but the job end time is 1y 
> from the start time, meaning its job end time is in 2023.
> So something with the dbd is confused about this/these jobs that are 
> lingering and reporting cancelled but still “on the books” somehow 
> until next August.
>
>> ╭──────────────────────────────────────────────────────────────────────────────────────────╮
>> │              │
>> │  Job ID     : 290742             │
>> │  Job Name     : $jobname             │
>> │  User     : $user              │
>> │  Group    : $user            │
>> │  Job Account    : $account             │
>> │  Job Submission     : 2022-08-08 08:44:52 -0400 EDT              │
>> │  Job Start    : 2022-08-08 08:46:53 -0400 EDT            │
>> │  Job End    : 2023-08-08 08:47:01 -0400 EDT            │
>> │  Job Wait time    : 2m1s             │
>> │  Job Run time     : 8760h0m8s              │
>> │  Partition    : $part            │
>> │  Priority     : 127282             │
>> │  QoS    : $qos             │
>> │              │
>> │              │
>> ╰──────────────────────────────────────────────────────────────────────────────────────────╯
>> Steps count: 0
>
>> Filter: $user         Items: 13
>>
>>  Job ID      Job Name                             Part.  QoS Account 
>>     User             Nodes                 State
>> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>>  290714  $jobname                             $part  $qos  $acct     
>>   $user            node32  CANCELLED
>>  290716  $jobname                             $part  $qos  $acct     
>>   $user            node24  CANCELLED
>>  290736  $jobname                             $part  $qos  $acct     
>>   $user            node00  CANCELLED
>>  290742  $jobname                             $part  $qos  $acct     
>>   $user            node01  CANCELLED
>>  290770  $jobname                             $part  $qos  $acct     
>>   $user            node02  CANCELLED
>>  290777  $jobname                             $part  $qos  $acct     
>>   $user            node03  CANCELLED
>>  290793  $jobname                             $part  $qos  $acct     
>>   $user            node04  CANCELLED
>>  290797  $jobname                             $part  $qos  $acct     
>>   $user            node05  CANCELLED
>>  290799  $jobname                             $part  $qos  $acct     
>>   $user            node06  CANCELLED
>>  290801  $jobname                             $part  $qos  $acct     
>>   $user            node07  CANCELLED
>>  290814  $jobname                             $part  $qos  $acct     
>>   $user            node08  CANCELLED
>>  290817  $jobname                             $part  $qos  $acct     
>>   $user            node09  CANCELLED
>>  290819  $jobname                             $part  $qos  $acct     
>>   $user            node10  CANCELLED
>
> I’d love to figure out the proper way to either purge these jid’s from 
> the accounting database cleanly, or change the job end/run time to a 
> sane/correct value.
> Slurm is v21.08.8-2, and ntp is a stratum 1 server, so time is in sync 
> everywhere, not that multiple servers would drift 1 year off like this.
>
> Thanks for any help,
> Reed
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221220/cec96f91/attachment.htm>


More information about the slurm-users mailing list