[slurm-users] Job cancelled into the future
Brian Andrus
toomuchit at gmail.com
Tue Dec 20 16:03:44 UTC 2022
Try:
sacctmgr list runawayjobs
Brian Andrus
On 12/20/2022 7:54 AM, Reed Dier wrote:
> Hoping this is a fairly simple one.
>
> This is a small internal cluster that we’ve been using for about 6
> months now, and we’ve had some infrastructure instability in that
> time, which I think may be the root culprit behind this weirdness, but
> hopefully someone can point me in the direction to solve the issue.
>
> I do a daily email of sreport to show how busy the cluster was, and
> who were the top users.
> Weirdly, I have a user that seems to be able to use the same exact
> usage day after day after day, down to hundredth of a percent,
> conspicuously even when they were on vacation and claimed that they
> didn’t have job submissions in cron/etc.
>
> So then, taking a spin of the scom tui
> <https://lists.schedmd.com/pipermail/slurm-users/2022-December/009514.html>posted
> this morning, I then filtered that user, and noticed that even though
> I was only looking 2 days back at job history, I was seeing a job from
> August.
>
> Conspicuously, the job state is cancelled, but the job end time is 1y
> from the start time, meaning its job end time is in 2023.
> So something with the dbd is confused about this/these jobs that are
> lingering and reporting cancelled but still “on the books” somehow
> until next August.
>
>> ╭──────────────────────────────────────────────────────────────────────────────────────────╮
>> │ │
>> │ Job ID : 290742 │
>> │ Job Name : $jobname │
>> │ User : $user │
>> │ Group : $user │
>> │ Job Account : $account │
>> │ Job Submission : 2022-08-08 08:44:52 -0400 EDT │
>> │ Job Start : 2022-08-08 08:46:53 -0400 EDT │
>> │ Job End : 2023-08-08 08:47:01 -0400 EDT │
>> │ Job Wait time : 2m1s │
>> │ Job Run time : 8760h0m8s │
>> │ Partition : $part │
>> │ Priority : 127282 │
>> │ QoS : $qos │
>> │ │
>> │ │
>> ╰──────────────────────────────────────────────────────────────────────────────────────────╯
>> Steps count: 0
>
>> Filter: $user Items: 13
>>
>> Job ID Job Name Part. QoS Account
>> User Nodes State
>> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>> 290714 $jobname $part $qos $acct
>> $user node32 CANCELLED
>> 290716 $jobname $part $qos $acct
>> $user node24 CANCELLED
>> 290736 $jobname $part $qos $acct
>> $user node00 CANCELLED
>> 290742 $jobname $part $qos $acct
>> $user node01 CANCELLED
>> 290770 $jobname $part $qos $acct
>> $user node02 CANCELLED
>> 290777 $jobname $part $qos $acct
>> $user node03 CANCELLED
>> 290793 $jobname $part $qos $acct
>> $user node04 CANCELLED
>> 290797 $jobname $part $qos $acct
>> $user node05 CANCELLED
>> 290799 $jobname $part $qos $acct
>> $user node06 CANCELLED
>> 290801 $jobname $part $qos $acct
>> $user node07 CANCELLED
>> 290814 $jobname $part $qos $acct
>> $user node08 CANCELLED
>> 290817 $jobname $part $qos $acct
>> $user node09 CANCELLED
>> 290819 $jobname $part $qos $acct
>> $user node10 CANCELLED
>
> I’d love to figure out the proper way to either purge these jid’s from
> the accounting database cleanly, or change the job end/run time to a
> sane/correct value.
> Slurm is v21.08.8-2, and ntp is a stratum 1 server, so time is in sync
> everywhere, not that multiple servers would drift 1 year off like this.
>
> Thanks for any help,
> Reed
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221220/cec96f91/attachment.htm>
More information about the slurm-users
mailing list