<div dir="ltr"><div>Totally agree with the solution.  We were running slurm 15.xx for some time and the manual job edit was miserable.  In 16 the command noted was created.  Have used it often and been pleased.  <br></div><div><br></div><div>Doug<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jan 8, 2020 at 7:40 AM Douglas Jacobsen <<a href="mailto:dmjacobsen@lbl.gov">dmjacobsen@lbl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Try running `sacctmgr show runawayjobs`;  it should give you the list of running/pending jobs (from slurmdbd's perspective) that are unknown to slurmctld.  It will give you the option to "fix" it, however note that fixing will set the end time of the job to the start time, so the accounting will be defective, and it will re-roll (resummarize) accounting statistics back to that point in time.  If you fix a pending job, some versions of slurm set that re-roll time to 0 -- so it would re roll all accounting activity.<div><br></div><div>In some cases we've chosen to manually edit the start/end times of these runaway jobs in the jobs_table of the database directly instead in order to maintain appropriate accounting, however that is fraught with risk as well (and unless well timed, it can make it challenging to re-roll the statistics well).</div><div><br></div><div>These events often trace back to a crash of the slurmctld where some messages did not get received by the slurmdbd.<br clear="all"><div><div dir="ltr">----<br>Doug Jacobsen, Ph.D.<br>NERSC Senior Computing Engineer<br>Group Lead, Computational Systems Group<br>National Energy Research Scientific Computing Center<br><a href="mailto:dmjacobsen@lbl.gov" target="_blank">dmjacobsen@lbl.gov</a><br><br>------------- __o<br>---------- _ '\<,_<br>----------(_)/  (_)__________________________<br></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jan 8, 2020 at 6:24 AM Steffen Grunewald <<a href="mailto:steffen.grunewald@aei.mpg.de" target="_blank">steffen.grunewald@aei.mpg.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Good afternoon everyone,<br>

<br>

when trying to collect some accounting information from my Slurm cluster,<br>

I found a couple of jobs that haven't been recorded as "finished", and<br>

therefore show up in every single day accounting since their start date.<br>

<br>

All those jobs have been started within a 2-hour period, 75 days ago.<br>

There's no partition allowing run times that long.<br>

<br>

I can extract the job ids, "sacct -j $id" returns "COMPLETED".<br>

<br>

Is there a means to force an end timestamp into the database for only<br>

these jobs (START matches the specific date, there's no end date)?<br>

<br>

Thanks for any suggestion.<br>

<br>

- Steffen<br>

<br>

--<br>

Steffen Grunewald, Cluster Administrator<br>

Max Planck Institute for Gravitational Physics (Albert Einstein Institute)<br>

Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany<br>

~~~<br>

Fon: +49-331-567 7274<br>

Mail: steffen.grunewald(at)<a href="http://aei.mpg.de" rel="noreferrer" target="_blank">aei.mpg.de</a><br>

~~~<br>

<br>

</blockquote></div>

</blockquote></div>