[slurm-users] Forcibly end "zombie" jobs?

Doug Meyer dameyer99 at gmail.com
Thu Jan 9 00:57:32 UTC 2020


Totally agree with the solution.  We were running slurm 15.xx for some time
and the manual job edit was miserable.  In 16 the command noted was
created.  Have used it often and been pleased.

Doug

On Wed, Jan 8, 2020 at 7:40 AM Douglas Jacobsen <dmjacobsen at lbl.gov> wrote:

> Try running `sacctmgr show runawayjobs`;  it should give you the list of
> running/pending jobs (from slurmdbd's perspective) that are unknown to
> slurmctld.  It will give you the option to "fix" it, however note that
> fixing will set the end time of the job to the start time, so the
> accounting will be defective, and it will re-roll (resummarize) accounting
> statistics back to that point in time.  If you fix a pending job, some
> versions of slurm set that re-roll time to 0 -- so it would re roll all
> accounting activity.
>
> In some cases we've chosen to manually edit the start/end times of these
> runaway jobs in the jobs_table of the database directly instead in order to
> maintain appropriate accounting, however that is fraught with risk as well
> (and unless well timed, it can make it challenging to re-roll the
> statistics well).
>
> These events often trace back to a crash of the slurmctld where some
> messages did not get received by the slurmdbd.
> ----
> Doug Jacobsen, Ph.D.
> NERSC Senior Computing Engineer
> Group Lead, Computational Systems Group
> National Energy Research Scientific Computing Center
> dmjacobsen at lbl.gov
>
> ------------- __o
> ---------- _ '\<,_
> ----------(_)/  (_)__________________________
>
>
> On Wed, Jan 8, 2020 at 6:24 AM Steffen Grunewald <
> steffen.grunewald at aei.mpg.de> wrote:
>
>> Good afternoon everyone,
>>
>> when trying to collect some accounting information from my Slurm cluster,
>> I found a couple of jobs that haven't been recorded as "finished", and
>> therefore show up in every single day accounting since their start date.
>>
>> All those jobs have been started within a 2-hour period, 75 days ago.
>> There's no partition allowing run times that long.
>>
>> I can extract the job ids, "sacct -j $id" returns "COMPLETED".
>>
>> Is there a means to force an end timestamp into the database for only
>> these jobs (START matches the specific date, there's no end date)?
>>
>> Thanks for any suggestion.
>>
>> - Steffen
>>
>> --
>> Steffen Grunewald, Cluster Administrator
>> Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
>> Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
>> ~~~
>> Fon: +49-331-567 7274
>> Mail: steffen.grunewald(at)aei.mpg.de
>> ~~~
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200108/5373878f/attachment-0001.htm>


More information about the slurm-users mailing list