[slurm-users] Job cancelled into the future
Reed Dier
reed.dier at focusvq.com
Tue Dec 20 16:08:26 UTC 2022
2 votes for runawayjobs is a strong vote (and also something I’m glad to learn exists for the future), however, it does not appear to be the case.
> # sacctmgr show runawayjobs
> Runaway Jobs: No runaway jobs found on cluster $cluster
So unfortunately that doesn’t appear to be the culprit.
Appreciate the responses.
Reed
> On Dec 20, 2022, at 10:03 AM, Brian Andrus <toomuchit at gmail.com> wrote:
>
> Try:
>
> sacctmgr list runawayjobs
>
> Brian Andrus
>
> On 12/20/2022 7:54 AM, Reed Dier wrote:
>> Hoping this is a fairly simple one.
>>
>> This is a small internal cluster that we’ve been using for about 6 months now, and we’ve had some infrastructure instability in that time, which I think may be the root culprit behind this weirdness, but hopefully someone can point me in the direction to solve the issue.
>>
>> I do a daily email of sreport to show how busy the cluster was, and who were the top users.
>> Weirdly, I have a user that seems to be able to use the same exact usage day after day after day, down to hundredth of a percent, conspicuously even when they were on vacation and claimed that they didn’t have job submissions in cron/etc.
>>
>> So then, taking a spin of the scom tui <https://lists.schedmd.com/pipermail/slurm-users/2022-December/009514.html>posted this morning, I then filtered that user, and noticed that even though I was only looking 2 days back at job history, I was seeing a job from August.
>>
>> Conspicuously, the job state is cancelled, but the job end time is 1y from the start time, meaning its job end time is in 2023.
>> So something with the dbd is confused about this/these jobs that are lingering and reporting cancelled but still “on the books” somehow until next August.
>>
>>> ╭──────────────────────────────────────────────────────────────────────────────────────────╮
>>> │ │
>>> │ Job ID : 290742 │
>>> │ Job Name : $jobname │
>>> │ User : $user │
>>> │ Group : $user │
>>> │ Job Account : $account │
>>> │ Job Submission : 2022-08-08 08:44:52 -0400 EDT │
>>> │ Job Start : 2022-08-08 08:46:53 -0400 EDT │
>>> │ Job End : 2023-08-08 08:47:01 -0400 EDT │
>>> │ Job Wait time : 2m1s │
>>> │ Job Run time : 8760h0m8s │
>>> │ Partition : $part │
>>> │ Priority : 127282 │
>>> │ QoS : $qos │
>>> │ │
>>> │ │
>>> ╰──────────────────────────────────────────────────────────────────────────────────────────╯
>>> Steps count: 0
>>
>>> Filter: $user Items: 13
>>>
>>> Job ID Job Name Part. QoS Account User Nodes State
>>> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>>> 290714 $jobname $part $qos $acct $user node32 CANCELLED
>>> 290716 $jobname $part $qos $acct $user node24 CANCELLED
>>> 290736 $jobname $part $qos $acct $user node00 CANCELLED
>>> 290742 $jobname $part $qos $acct $user node01 CANCELLED
>>> 290770 $jobname $part $qos $acct $user node02 CANCELLED
>>> 290777 $jobname $part $qos $acct $user node03 CANCELLED
>>> 290793 $jobname $part $qos $acct $user node04 CANCELLED
>>> 290797 $jobname $part $qos $acct $user node05 CANCELLED
>>> 290799 $jobname $part $qos $acct $user node06 CANCELLED
>>> 290801 $jobname $part $qos $acct $user node07 CANCELLED
>>> 290814 $jobname $part $qos $acct $user node08 CANCELLED
>>> 290817 $jobname $part $qos $acct $user node09 CANCELLED
>>> 290819 $jobname $part $qos $acct $user node10 CANCELLED
>>
>>
>> I’d love to figure out the proper way to either purge these jid’s from the accounting database cleanly, or change the job end/run time to a sane/correct value.
>> Slurm is v21.08.8-2, and ntp is a stratum 1 server, so time is in sync everywhere, not that multiple servers would drift 1 year off like this.
>>
>> Thanks for any help,
>> Reed
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221220/dfc7c3bf/attachment-0001.htm>
More information about the slurm-users
mailing list