[slurm-users] bufferoverflow in slurmd with acct_gather_energy plugin
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Tue Aug 29 09:08:25 UTC 2023
Hi Magnus,
On 8/28/23 10:16, Hagdorn, Magnus Karl Moritz wrote:
> we recently enabled the energy gathering plugin on using the IPMI
> gatherer with libfreeipmi. We are running the latest slurm 23.02.4 on
> rocky 8.5. We are getting sporadic buffer overflows in slurmd when it
> is trying to query the IPMI interface. We have the feeling this occurs
> when a lot of jobs are getting started on the node. Has anybody come
> across this issue and even better found a solution?
I'm curious to learn about your energy gathering method: How do you
extract node power using IPMI using FreeIMPI (or some other toolset), and
how do you configure Slurm for this?
In our cluster I select a Dell node where I obtain this IPMI power reading
from the BMC using a FreeIMPI tool:
> $ ipmi-dcmi -D LAN_2_0 --username=root --password=<secret> --hostname=c190b --get-system-power-statistics
> Current Power : 151 Watts
> Minimum Power over sampling duration : 6 watts
> Maximum Power over sampling duration : 293 watts
> Average Power over sampling duration : 153 watts
> Time Stamp : 08/29/2023 - 08:54:03
> Statistics reporting time period : 1000 milliseconds
> Power Measurement : Active
However, the node's iDRAC BMC web GUI presents a somewhat different
reading, which I assume must be reliable: 168 W.
I'm also using the Slurm with
AcctGatherEnergyType=acct_gather_energy/rapl, see [1]. With RAPL and
"scontrol show node c190" Slurm reports CurrentWatts=177 which just
measures CPU+DIMM power.
Thanks for sharing any insights.
Best regards,
Ole
[1]
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#power-monitoring-and-management
More information about the slurm-users
mailing list