OS: CentOS 8.5 Slurm: 22.05
Recently upgraded to 22.05. Upgrade was successful, but after a while I started to see the following messages in the slurmdbd.log file:
error: We have more time than is possible (9344745+7524000+0)(16868745) > 12362400 for cluster CLUSTERNAME(3434) from 2024-09-18T13:00:00 - 2024-09-18T14:00:00 tres 1 (this may happen if oversubscription of resources is allowed without Gang)
We do have partitions with overlapping nodes, but do not have "Suspend,Gang" set as the global PreemptMode mode. It is currently set to requeue.
I have also check sacct and there are no runaway jobs listed.
Oversubscription is not enabled on any of the queues as well.
Do I need to modify my slurm config to address or is this an error condition caused by the upgrade?
Thank you,
SS
I don’t think you should expect this from overlapping nodes in partitions, but instead whe you’re allowing hardware itself to be oversubscribed.
Was your upgrade in this window?
I would suggest looking for runaway jobs, which you’ve done, and am not sure what else.
-- #BlackLivesMatter ____ || \UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj@rutgers.edu || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `'
On Sep 18, 2024, at 23:25, Sajesh Singh via slurm-users slurm-users@lists.schedmd.com wrote:
OS: CentOS 8.5 Slurm: 22.05
Recently upgraded to 22.05. Upgrade was successful, but after a while I started to see the following messages in the slurmdbd.log file:
error: We have more time than is possible (9344745+7524000+0)(16868745) > 12362400 for cluster CLUSTERNAME(3434) from 2024-09-18T13:00:00 - 2024-09-18T14:00:00 tres 1 (this may happen if oversubscription of resources is allowed without Gang)
We do have partitions with overlapping nodes, but do not have “Suspend,Gang” set as the global PreemptMode mode. It is currently set to requeue.
I have also check sacct and there are no runaway jobs listed.
Oversubscription is not enabled on any of the queues as well.
Do I need to modify my slurm config to address or is this an error condition caused by the upgrade?
Thank you,
SS
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com
The upgrade was a couple of hours prior to the messages appearing in the logs.
SS ________________________________ From: Ryan Novosielski novosirj@rutgers.edu Sent: Thursday, September 19, 2024 12:08:42 AM To: Sajesh Singh ssingh@amnh.org Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: Re: [slurm-users] SlurmDBD errors
EXTERNAL SENDER
I don’t think you should expect this from overlapping nodes in partitions, but instead whe you’re allowing hardware itself to be oversubscribed.
Was your upgrade in this window?
I would suggest looking for runaway jobs, which you’ve done, and am not sure what else.
-- #BlackLivesMatter ____ || \UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj@rutgers.edu || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `'
On Sep 18, 2024, at 23:25, Sajesh Singh via slurm-users slurm-users@lists.schedmd.com wrote:
OS: CentOS 8.5 Slurm: 22.05
Recently upgraded to 22.05. Upgrade was successful, but after a while I started to see the following messages in the slurmdbd.log file:
error: We have more time than is possible (9344745+7524000+0)(16868745) > 12362400 for cluster CLUSTERNAME(3434) from 2024-09-18T13:00:00 - 2024-09-18T14:00:00 tres 1 (this may happen if oversubscription of resources is allowed without Gang)
We do have partitions with overlapping nodes, but do not have “Suspend,Gang” set as the global PreemptMode mode. It is currently set to requeue.
I have also check sacct and there are no runaway jobs listed.
Oversubscription is not enabled on any of the queues as well.
Do I need to modify my slurm config to address or is this an error condition caused by the upgrade?
Thank you,
SS
-- slurm-users mailing list -- slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.commailto:slurm-users-leave@lists.schedmd.com