[slurm-users] Solved, Re: Forcibly end "zombie" jobs?

Steffen Grunewald steffen.grunewald at aei.mpg.de
Fri Jan 10 13:35:26 UTC 2020


Hi Doug,


On Wed, 2020-01-08 at 06:38:32 -0800, Douglas Jacobsen wrote:
> Try running `sacctmgr show runawayjobs`;  it should give you the list of
> running/pending jobs (from slurmdbd's perspective) that are unknown to
> slurmctld.

Thanks for this suggestion, it was the perfect solution.

No more "error: We have more allocated time than is possible" messages.

  It will give you the option to "fix" it, however note that
> fixing will set the end time of the job to the start time,

Better than nothing, or erratic daily sums.

                                                             so the
> accounting will be defective, and it will re-roll (resummarize) accounting
> statistics back to that point in time.  If you fix a pending job, some
> versions of slurm set that re-roll time to 0 -- so it would re roll all
> accounting activity.

This rerolling will take some time, I suppose? (I'll wait until Monday then
before rerunning the summing job.)

> In some cases we've chosen to manually edit the start/end times of these
> runaway jobs in the jobs_table of the database directly instead in order to
> maintain appropriate accounting, however that is fraught with risk as well
> (and unless well timed, it can make it challenging to re-roll the
> statistics well).

This paragraph, I admit, was too frightening - I have already lost accounting
data on another cluster (a HTCondor pool which did its history rotations too
quickly) :(

> These events often trace back to a crash of the slurmctld where some
> messages did not get received by the slurmdbd.

It seems that the slurmdbd indeed got restarted after those jobs had been
submitted (and the log file got zeroed) - although there's no indication of
a slurmctld crash corresponding to that day.

In any case, the situation apparently has been resolved - I've got to wait
for the daily rollup to fix the old accounting data though.

Thanks a lot!

- Steffen


-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~



More information about the slurm-users mailing list