EXTERNAL SENDER

I don’t think you should expect this from overlapping nodes in partitions, but instead whe you’re allowing hardware itself to be oversubscribed.

Was your upgrade in this window?

I would suggest looking for runaway jobs, which you’ve done, and am not sure what else.

--
#BlackLivesMatter

____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - novosirj@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB A555B, Newark
`'

On Sep 18, 2024, at 23:25, Sajesh Singh via slurm-users <slurm-users@lists.schedmd.com> wrote:

OS: CentOS 8.5

Slurm: 22.05

Recently upgraded to 22.05. Upgrade was successful, but after a while I started to see the following messages in the slurmdbd.log file:

error: We have more time than is possible (9344745+7524000+0)(16868745) > 12362400 for cluster CLUSTERNAME(3434) from 2024-09-18T13:00:00 - 2024-09-18T14:00:00 tres 1 (this may happen if oversubscription of resources is allowed without Gang)

We do have partitions with overlapping nodes, but do not have “Suspend,Gang” set as the global PreemptMode mode. It is currently set to requeue.

I have also check sacct and there are no runaway jobs listed.

Oversubscription is not enabled on any of the queues as well.

Do I need to modify my slurm config to address or is this an error condition caused by the upgrade?

Thank you,

SS

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com