<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">2 votes for runawayjobs is a strong vote (and also something I’m glad to learn exists for the future), however, it does not appear to be the case.<div class=""><br class=""></div><div class=""><blockquote type="cite" class=""><div class=""># sacctmgr show runawayjobs</div><div class="">Runaway Jobs: No runaway jobs found on cluster $cluster</div></blockquote><div class=""><br class=""></div>So unfortunately that doesn’t appear to be the culprit.</div><div class=""><br class=""></div><div class="">Appreciate the responses.</div><div class=""><br class=""></div><div class="">Reed<br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Dec 20, 2022, at 10:03 AM, Brian Andrus <<a href="mailto:toomuchit@gmail.com" class="">toomuchit@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class="">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" class="">
<div class=""><p class="">Try: <br class="">
</p><p class=""> sacctmgr list runawayjobs</p><p class="">Brian Andrus<br class="">
</p>
<div class="moz-cite-prefix">On 12/20/2022 7:54 AM, Reed Dier wrote:<br class="">
</div>
<blockquote type="cite" cite="mid:069A5B5A-CC57-46B8-9CDE-095CA83D7C83@focusvq.com" class="">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" class="">
Hoping this is a fairly simple one.
<div class=""><br class="">
</div>
<div class="">This is a small internal cluster that we’ve been
using for about 6 months now, and we’ve had some infrastructure
instability in that time, which I think may be the root culprit
behind this weirdness, but hopefully someone can point me in the
direction to solve the issue.</div>
<div class=""><br class="">
</div>
<div class="">I do a daily email of sreport to show how busy the
cluster was, and who were the top users.</div>
<div class="">Weirdly, I have a user that seems to be able to use
the same exact usage day after day after day, down to hundredth
of a percent, conspicuously even when they were on vacation and
claimed that they didn’t have job submissions in cron/etc.</div>
<div class=""><br class="">
</div>
<div class="">So then, taking a spin of the <a href="https://lists.schedmd.com/pipermail/slurm-users/2022-December/009514.html" class="" moz-do-not-send="true">scom tui </a>posted this
morning, I then filtered that user, and noticed that even though
I was only looking 2 days back at job history, I was seeing a
job from August.</div>
<div class=""><br class="">
</div>
<div class="">Conspicuously, the job state is cancelled, but the
job end time is 1y from the start time, meaning its job end time
is in 2023.</div>
<div class="">So something with the dbd is confused about
this/these jobs that are lingering and reporting cancelled but
still “on the books” somehow until next August.</div>
<div class=""><br class="">
</div>
<div class="">
<blockquote type="cite" class="">
<div class=""><font class="" face="Menlo">╭──────────────────────────────────────────────────────────────────────────────────────────╮</font></div>
<div class=""><font class="" face="Menlo">│
│</font></div>
<div class=""><font class="" face="Menlo">│ Job ID
: 290742
│</font></div>
<div class=""><font class="" face="Menlo">│ Job Name
: $jobname
│</font></div>
<div class=""><font class="" face="Menlo">│ User
: $user
│</font></div>
<div class=""><font class="" face="Menlo">│ Group
: $user
│</font></div>
<div class=""><font class="" face="Menlo">│ Job Account
: $account
│</font></div>
<div class=""><font class="" face="Menlo">│ Job Submission
: 2022-08-08 08:44:52 -0400 EDT
│</font></div>
<div class=""><font class="" face="Menlo">│ Job Start
: 2022-08-08 08:46:53 -0400 EDT
│</font></div>
<div class=""><font class="" face="Menlo">│ Job End
: 2023-08-08 08:47:01 -0400 EDT
│</font></div>
<div class=""><font class="" face="Menlo">│ Job Wait time
: 2m1s
│</font></div>
<div class=""><font class="" face="Menlo">│ Job Run time
: 8760h0m8s
│</font></div>
<div class=""><font class="" face="Menlo">│ Partition
: $part
│</font></div>
<div class=""><font class="" face="Menlo">│ Priority
: 127282
│</font></div>
<div class=""><font class="" face="Menlo">│ QoS
: $qos
│</font></div>
<div class=""><font class="" face="Menlo">│
│</font></div>
<div class=""><font class="" face="Menlo">│
│</font></div>
<div class=""><font class="" face="Menlo">╰──────────────────────────────────────────────────────────────────────────────────────────╯</font></div>
<div class=""><font class="" face="Menlo">Steps count: 0</font></div>
</blockquote>
<br class="">
</div>
<div class="">
<blockquote type="cite" class=""><font class="" face="Menlo">Filter:
$user Items: 13</font></blockquote>
<blockquote type="cite" class="">
<div class=""><font class="" face="Menlo"><br class="">
</font></div>
<div class=""><font class="" face="Menlo"> Job ID Job
Name Part. QoS
Account User Nodes State</font></div>
<div class=""><font class="" face="Menlo">───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────</font></div>
<div class=""><font class="" face="Menlo"> 290714
$jobname $part $qos
$acct $user node32
CANCELLED</font></div>
<div class=""><font class="" face="Menlo"> 290716
$jobname $part $qos
$acct $user node24
CANCELLED</font></div>
<div class=""><font class="" face="Menlo"> 290736
$jobname $part $qos
$acct $user node00
CANCELLED</font></div>
<div class=""><font class="" face="Menlo"> 290742
$jobname $part $qos
$acct $user node01
CANCELLED</font></div>
<div class=""><font class="" face="Menlo"> 290770
$jobname $part $qos
$acct $user node02
CANCELLED</font></div>
<div class=""><font class="" face="Menlo"> 290777
$jobname $part $qos
$acct $user node03
CANCELLED</font></div>
<div class=""><font class="" face="Menlo"> 290793
$jobname $part $qos
$acct $user node04
CANCELLED</font></div>
<div class=""><font class="" face="Menlo"> 290797
$jobname $part $qos
$acct $user node05
CANCELLED</font></div>
<div class=""><font class="" face="Menlo"> 290799
$jobname $part $qos
$acct $user node06
CANCELLED</font></div>
<div class=""><font class="" face="Menlo"> 290801
$jobname $part $qos
$acct $user node07
CANCELLED</font></div>
<div class=""><font class="" face="Menlo"> 290814
$jobname $part $qos
$acct $user node08
CANCELLED</font></div>
<div class=""><font class="" face="Menlo"> 290817
$jobname $part $qos
$acct $user node09
CANCELLED</font></div>
<div class=""><font class="" face="Menlo"> 290819
$jobname $part $qos
$acct $user node10
CANCELLED</font></div>
</blockquote>
</div>
<div class=""><br class="">
</div>
<div class="">I’d love to figure out the proper way to either
purge these jid’s from the accounting database cleanly, or
change the job end/run time to a sane/correct value.</div>
<div class="">Slurm is v21.08.8-2, and ntp is a stratum 1 server,
so time is in sync everywhere, not that multiple servers would
drift 1 year off like this.</div>
<div class=""><br class="">
</div>
<div class="">Thanks for any help,</div>
<div class="">Reed</div>
</blockquote>
</div>
</div></blockquote></div><br class=""></div></body></html>