<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Seems like the time may have been off on the db server at the
insert/update.</p>
<p>You may want to dump the database, find what table/records need
updated and try updating them. If anything went south, you could
restore from the dump.</p>
<p>Brian Andrus<br>
</p>
<div class="moz-cite-prefix">On 12/20/2022 11:51 AM, Reed Dier
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:A889B640-1F36-4DDE-9603-B366ACCCD1E6@focusvq.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Just to followup with some things I’ve tried:
<div class=""><br class="">
</div>
<div class="">scancel doesn’t want to touch it:</div>
<div class="">
<blockquote type="cite" class="">
<div class=""><font class="" face="Menlo"># scancel -v 290710</font></div>
<div class=""><font class="" face="Menlo">scancel: Terminating
job 290710</font></div>
<div class=""><font class="" face="Menlo">scancel: error: Kill
job error on job id 290710: Job/step already completing or
completed</font></div>
</blockquote>
<div><br class="">
</div>
<div>pscontrol does see that these are all members of the same
array, but doesn’t want to touch it:</div>
<div>
<blockquote type="cite" class="">
<div><font class="" face="Menlo"># scontrol update
JobID=290710 EndTime=2022-08-09T08:47:01</font></div>
<div><font class="" face="Menlo">290710_4,6,26,32,60,67,83,87,89,91,...:
Job has already finished</font></div>
</blockquote>
<br class="">
</div>
<div>And trying to modify the job’s end time with sacctmgr
fails, as expected, to modify the EndTime because EndTime is
only a where spec, not a set spec, also tried EndTime=now with
same results:</div>
<div>
<blockquote type="cite" class="">
<div><font class="" face="Menlo"># sacctmgr modify job where
JobID=290710 set EndTime=2022-08-09T08:47:01</font></div>
<div><font class="" face="Menlo"> Unknown option:
EndTime=2022-08-09T08:47:01</font></div>
<div><font class="" face="Menlo"> Use keyword 'where' to
modify condition</font></div>
<div><font class="" face="Menlo"> You didn't give me
anything to set</font></div>
</blockquote>
<br class="">
</div>
<div>I was able to set a comment for the jobs/array, so the DBD
can see/talk to them.</div>
<div>One additional thing to mention is that there are 14 JIDs
that are stuck like this, 1 is an Array JID, and 13 of them
are array tasks on the original Array ID.</div>
<div><br class="">
</div>
<div>But figured I would provide some of the other steps I’ve
tried to flush those ideas.</div>
<div><br class="">
</div>
<div>Thanks,</div>
<div>Reed</div>
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Dec 20, 2022, at 10:08 AM, Reed Dier <<a
href="mailto:reed.dier@focusvq.com"
class="moz-txt-link-freetext" moz-do-not-send="true">reed.dier@focusvq.com</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<meta http-equiv="Content-Type" content="text/html;
charset=UTF-8" class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode:
space; line-break: after-white-space;" class="">2 votes
for runawayjobs is a strong vote (and also something I’m
glad to learn exists for the future), however, it does
not appear to be the case.
<div class=""><br class="">
</div>
<div class="">
<blockquote type="cite" class="">
<div class=""># sacctmgr show runawayjobs</div>
<div class="">Runaway Jobs: No runaway jobs found on
cluster $cluster</div>
</blockquote>
<div class=""><br class="">
</div>
So unfortunately that doesn’t appear to be the
culprit.</div>
<div class=""><br class="">
</div>
<div class="">Appreciate the responses.</div>
<div class=""><br class="">
</div>
<div class="">Reed<br class="">
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Dec 20, 2022, at 10:03 AM, Brian
Andrus <<a href="mailto:toomuchit@gmail.com"
class="moz-txt-link-freetext"
moz-do-not-send="true">toomuchit@gmail.com</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8" class="">
<div class="">
<p class="">Try: <br class="">
</p>
<p class=""> sacctmgr list runawayjobs</p>
<p class="">Brian Andrus<br class="">
</p>
<div class="moz-cite-prefix">On 12/20/2022
7:54 AM, Reed Dier wrote:<br class="">
</div>
<blockquote type="cite"
cite="mid:069A5B5A-CC57-46B8-9CDE-095CA83D7C83@focusvq.com"
class="">
<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8"
class="">
Hoping this is a fairly simple one.
<div class=""><br class="">
</div>
<div class="">This is a small internal
cluster that we’ve been using for about 6
months now, and we’ve had some
infrastructure instability in that time,
which I think may be the root culprit
behind this weirdness, but hopefully
someone can point me in the direction to
solve the issue.</div>
<div class=""><br class="">
</div>
<div class="">I do a daily email of sreport
to show how busy the cluster was, and who
were the top users.</div>
<div class="">Weirdly, I have a user that
seems to be able to use the same exact
usage day after day after day, down to
hundredth of a percent, conspicuously even
when they were on vacation and claimed
that they didn’t have job submissions in
cron/etc.</div>
<div class=""><br class="">
</div>
<div class="">So then, taking a spin of the <a
href="https://lists.schedmd.com/pipermail/slurm-users/2022-December/009514.html"
class="" moz-do-not-send="true">scom
tui </a>posted this morning, I then
filtered that user, and noticed that even
though I was only looking 2 days back at
job history, I was seeing a job from
August.</div>
<div class=""><br class="">
</div>
<div class="">Conspicuously, the job state
is cancelled, but the job end time is 1y
from the start time, meaning its job end
time is in 2023.</div>
<div class="">So something with the dbd is
confused about this/these jobs that are
lingering and reporting cancelled but
still “on the books” somehow until next
August.</div>
<div class=""><br class="">
</div>
<div class="">
<blockquote type="cite" class="">
<div class=""><font class=""
face="Menlo">╭──────────────────────────────────────────────────────────────────────────────────────────╮</font></div>
<div class=""><font class=""
face="Menlo">│
│</font></div>
<div class=""><font class=""
face="Menlo">│ Job ID
: 290742
│</font></div>
<div class=""><font class=""
face="Menlo">│ Job Name
: $jobname
│</font></div>
<div class=""><font class=""
face="Menlo">│ User
: $user
│</font></div>
<div class=""><font class=""
face="Menlo">│ Group
: $user
│</font></div>
<div class=""><font class=""
face="Menlo">│ Job Account
: $account
│</font></div>
<div class=""><font class=""
face="Menlo">│ Job Submission
: 2022-08-08 08:44:52 -0400 EDT
│</font></div>
<div class=""><font class=""
face="Menlo">│ Job Start
: 2022-08-08 08:46:53 -0400 EDT
│</font></div>
<div class=""><font class=""
face="Menlo">│ Job End
: 2023-08-08 08:47:01 -0400 EDT
│</font></div>
<div class=""><font class=""
face="Menlo">│ Job Wait time
: 2m1s
│</font></div>
<div class=""><font class=""
face="Menlo">│ Job Run time
: 8760h0m8s
│</font></div>
<div class=""><font class=""
face="Menlo">│ Partition
: $part
│</font></div>
<div class=""><font class=""
face="Menlo">│ Priority
: 127282
│</font></div>
<div class=""><font class=""
face="Menlo">│ QoS
: $qos
│</font></div>
<div class=""><font class=""
face="Menlo">│
│</font></div>
<div class=""><font class=""
face="Menlo">│
│</font></div>
<div class=""><font class=""
face="Menlo">╰──────────────────────────────────────────────────────────────────────────────────────────╯</font></div>
<div class=""><font class=""
face="Menlo">Steps count: 0</font></div>
</blockquote>
<br class="">
</div>
<div class="">
<blockquote type="cite" class=""><font
class="" face="Menlo">Filter: $user
Items: 13</font></blockquote>
<blockquote type="cite" class="">
<div class=""><font class=""
face="Menlo"><br class="">
</font></div>
<div class=""><font class=""
face="Menlo"> Job ID Job Name
Part. QoS
Account User
Nodes State</font></div>
<div class=""><font class=""
face="Menlo">───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────</font></div>
<div class=""><font class=""
face="Menlo"> 290714 $jobname
$part
$qos $acct $user
node32 CANCELLED</font></div>
<div class=""><font class=""
face="Menlo"> 290716 $jobname
$part
$qos $acct $user
node24 CANCELLED</font></div>
<div class=""><font class=""
face="Menlo"> 290736 $jobname
$part
$qos $acct $user
node00 CANCELLED</font></div>
<div class=""><font class=""
face="Menlo"> 290742 $jobname
$part
$qos $acct $user
node01 CANCELLED</font></div>
<div class=""><font class=""
face="Menlo"> 290770 $jobname
$part
$qos $acct $user
node02 CANCELLED</font></div>
<div class=""><font class=""
face="Menlo"> 290777 $jobname
$part
$qos $acct $user
node03 CANCELLED</font></div>
<div class=""><font class=""
face="Menlo"> 290793 $jobname
$part
$qos $acct $user
node04 CANCELLED</font></div>
<div class=""><font class=""
face="Menlo"> 290797 $jobname
$part
$qos $acct $user
node05 CANCELLED</font></div>
<div class=""><font class=""
face="Menlo"> 290799 $jobname
$part
$qos $acct $user
node06 CANCELLED</font></div>
<div class=""><font class=""
face="Menlo"> 290801 $jobname
$part
$qos $acct $user
node07 CANCELLED</font></div>
<div class=""><font class=""
face="Menlo"> 290814 $jobname
$part
$qos $acct $user
node08 CANCELLED</font></div>
<div class=""><font class=""
face="Menlo"> 290817 $jobname
$part
$qos $acct $user
node09 CANCELLED</font></div>
<div class=""><font class=""
face="Menlo"> 290819 $jobname
$part
$qos $acct $user
node10 CANCELLED</font></div>
</blockquote>
</div>
<div class=""><br class="">
</div>
<div class="">I’d love to figure out the
proper way to either purge these jid’s
from the accounting database cleanly, or
change the job end/run time to a
sane/correct value.</div>
<div class="">Slurm is v21.08.8-2, and ntp
is a stratum 1 server, so time is in sync
everywhere, not that multiple servers
would drift 1 year off like this.</div>
<div class=""><br class="">
</div>
<div class="">Thanks for any help,</div>
<div class="">Reed</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</blockquote>
</body>
</html>