[slurm-users] bufferoverflow in slurmd with acct_gather_energy plugin

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Tue Aug 29 09:08:25 UTC 2023


Hi Magnus,

On 8/28/23 10:16, Hagdorn, Magnus Karl Moritz wrote:
> we recently enabled the energy gathering plugin on using the IPMI
> gatherer with libfreeipmi. We are running the latest slurm 23.02.4 on
> rocky 8.5. We are getting sporadic buffer overflows in slurmd when it
> is trying to query the IPMI interface. We have the feeling this occurs
> when a lot of jobs are getting started on the node. Has anybody come
> across this issue and even better found a solution?

I'm curious to learn about your energy gathering method:  How do you 
extract node power using IPMI using FreeIMPI (or some other toolset), and 
how do you configure Slurm for this?

In our cluster I select a Dell node where I obtain this IPMI power reading 
from the BMC using a FreeIMPI tool:

> $ ipmi-dcmi -D LAN_2_0 --username=root --password=<secret> --hostname=c190b --get-system-power-statistics
> Current Power                        : 151 Watts
> Minimum Power over sampling duration : 6 watts
> Maximum Power over sampling duration : 293 watts
> Average Power over sampling duration : 153 watts
> Time Stamp                           : 08/29/2023 - 08:54:03
> Statistics reporting time period     : 1000 milliseconds
> Power Measurement                    : Active

However, the node's iDRAC BMC web GUI presents a somewhat different 
reading, which I assume must be reliable:  168 W.

I'm also using the Slurm with 
AcctGatherEnergyType=acct_gather_energy/rapl, see [1].  With RAPL and 
"scontrol show node c190" Slurm reports CurrentWatts=177 which just 
measures CPU+DIMM power.

Thanks for sharing any insights.

Best regards,
Ole

[1] 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#power-monitoring-and-management



More information about the slurm-users mailing list