[slurm-users] Forcibly end "zombie" jobs?

Douglas Jacobsen dmjacobsen at lbl.gov
Wed Jan 8 14:38:32 UTC 2020


Try running `sacctmgr show runawayjobs`;  it should give you the list of
running/pending jobs (from slurmdbd's perspective) that are unknown to
slurmctld.  It will give you the option to "fix" it, however note that
fixing will set the end time of the job to the start time, so the
accounting will be defective, and it will re-roll (resummarize) accounting
statistics back to that point in time.  If you fix a pending job, some
versions of slurm set that re-roll time to 0 -- so it would re roll all
accounting activity.

In some cases we've chosen to manually edit the start/end times of these
runaway jobs in the jobs_table of the database directly instead in order to
maintain appropriate accounting, however that is fraught with risk as well
(and unless well timed, it can make it challenging to re-roll the
statistics well).

These events often trace back to a crash of the slurmctld where some
messages did not get received by the slurmdbd.
----
Doug Jacobsen, Ph.D.
NERSC Senior Computing Engineer
Group Lead, Computational Systems Group
National Energy Research Scientific Computing Center
dmjacobsen at lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Wed, Jan 8, 2020 at 6:24 AM Steffen Grunewald <
steffen.grunewald at aei.mpg.de> wrote:

> Good afternoon everyone,
>
> when trying to collect some accounting information from my Slurm cluster,
> I found a couple of jobs that haven't been recorded as "finished", and
> therefore show up in every single day accounting since their start date.
>
> All those jobs have been started within a 2-hour period, 75 days ago.
> There's no partition allowing run times that long.
>
> I can extract the job ids, "sacct -j $id" returns "COMPLETED".
>
> Is there a means to force an end timestamp into the database for only
> these jobs (START matches the specific date, there's no end date)?
>
> Thanks for any suggestion.
>
> - Steffen
>
> --
> Steffen Grunewald, Cluster Administrator
> Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
> Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
> ~~~
> Fon: +49-331-567 7274
> Mail: steffen.grunewald(at)aei.mpg.de
> ~~~
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200108/d74e8e42/attachment.htm>


More information about the slurm-users mailing list