[slurm-users] Job cancelled into the future
Reed Dier
reed.dier at focusvq.com
Tue Dec 20 19:51:14 UTC 2022
Just to followup with some things I’ve tried:
scancel doesn’t want to touch it:
> # scancel -v 290710
> scancel: Terminating job 290710
> scancel: error: Kill job error on job id 290710: Job/step already completing or completed
pscontrol does see that these are all members of the same array, but doesn’t want to touch it:
> # scontrol update JobID=290710 EndTime=2022-08-09T08:47:01
> 290710_4,6,26,32,60,67,83,87,89,91,...: Job has already finished
And trying to modify the job’s end time with sacctmgr fails, as expected, to modify the EndTime because EndTime is only a where spec, not a set spec, also tried EndTime=now with same results:
> # sacctmgr modify job where JobID=290710 set EndTime=2022-08-09T08:47:01
> Unknown option: EndTime=2022-08-09T08:47:01
> Use keyword 'where' to modify condition
> You didn't give me anything to set
I was able to set a comment for the jobs/array, so the DBD can see/talk to them.
One additional thing to mention is that there are 14 JIDs that are stuck like this, 1 is an Array JID, and 13 of them are array tasks on the original Array ID.
But figured I would provide some of the other steps I’ve tried to flush those ideas.
Thanks,
Reed
> On Dec 20, 2022, at 10:08 AM, Reed Dier <reed.dier at focusvq.com> wrote:
>
> 2 votes for runawayjobs is a strong vote (and also something I’m glad to learn exists for the future), however, it does not appear to be the case.
>
>> # sacctmgr show runawayjobs
>> Runaway Jobs: No runaway jobs found on cluster $cluster
>
> So unfortunately that doesn’t appear to be the culprit.
>
> Appreciate the responses.
>
> Reed
>
>> On Dec 20, 2022, at 10:03 AM, Brian Andrus <toomuchit at gmail.com <mailto:toomuchit at gmail.com>> wrote:
>>
>> Try:
>>
>> sacctmgr list runawayjobs
>>
>> Brian Andrus
>>
>> On 12/20/2022 7:54 AM, Reed Dier wrote:
>>> Hoping this is a fairly simple one.
>>>
>>> This is a small internal cluster that we’ve been using for about 6 months now, and we’ve had some infrastructure instability in that time, which I think may be the root culprit behind this weirdness, but hopefully someone can point me in the direction to solve the issue.
>>>
>>> I do a daily email of sreport to show how busy the cluster was, and who were the top users.
>>> Weirdly, I have a user that seems to be able to use the same exact usage day after day after day, down to hundredth of a percent, conspicuously even when they were on vacation and claimed that they didn’t have job submissions in cron/etc.
>>>
>>> So then, taking a spin of the scom tui <https://lists.schedmd.com/pipermail/slurm-users/2022-December/009514.html>posted this morning, I then filtered that user, and noticed that even though I was only looking 2 days back at job history, I was seeing a job from August.
>>>
>>> Conspicuously, the job state is cancelled, but the job end time is 1y from the start time, meaning its job end time is in 2023.
>>> So something with the dbd is confused about this/these jobs that are lingering and reporting cancelled but still “on the books” somehow until next August.
>>>
>>>> ╭──────────────────────────────────────────────────────────────────────────────────────────╮
>>>> │ │
>>>> │ Job ID : 290742 │
>>>> │ Job Name : $jobname │
>>>> │ User : $user │
>>>> │ Group : $user │
>>>> │ Job Account : $account │
>>>> │ Job Submission : 2022-08-08 08:44:52 -0400 EDT │
>>>> │ Job Start : 2022-08-08 08:46:53 -0400 EDT │
>>>> │ Job End : 2023-08-08 08:47:01 -0400 EDT │
>>>> │ Job Wait time : 2m1s │
>>>> │ Job Run time : 8760h0m8s │
>>>> │ Partition : $part │
>>>> │ Priority : 127282 │
>>>> │ QoS : $qos │
>>>> │ │
>>>> │ │
>>>> ╰──────────────────────────────────────────────────────────────────────────────────────────╯
>>>> Steps count: 0
>>>
>>>> Filter: $user Items: 13
>>>>
>>>> Job ID Job Name Part. QoS Account User Nodes State
>>>> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>>>> 290714 $jobname $part $qos $acct $user node32 CANCELLED
>>>> 290716 $jobname $part $qos $acct $user node24 CANCELLED
>>>> 290736 $jobname $part $qos $acct $user node00 CANCELLED
>>>> 290742 $jobname $part $qos $acct $user node01 CANCELLED
>>>> 290770 $jobname $part $qos $acct $user node02 CANCELLED
>>>> 290777 $jobname $part $qos $acct $user node03 CANCELLED
>>>> 290793 $jobname $part $qos $acct $user node04 CANCELLED
>>>> 290797 $jobname $part $qos $acct $user node05 CANCELLED
>>>> 290799 $jobname $part $qos $acct $user node06 CANCELLED
>>>> 290801 $jobname $part $qos $acct $user node07 CANCELLED
>>>> 290814 $jobname $part $qos $acct $user node08 CANCELLED
>>>> 290817 $jobname $part $qos $acct $user node09 CANCELLED
>>>> 290819 $jobname $part $qos $acct $user node10 CANCELLED
>>>
>>>
>>> I’d love to figure out the proper way to either purge these jid’s from the accounting database cleanly, or change the job end/run time to a sane/correct value.
>>> Slurm is v21.08.8-2, and ntp is a stratum 1 server, so time is in sync everywhere, not that multiple servers would drift 1 year off like this.
>>>
>>> Thanks for any help,
>>> Reed
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221220/7ca9edf5/attachment-0001.htm>
More information about the slurm-users
mailing list