Hello, fellow users:
I have been using Slurm for the past three years, but recently I bumped into a doubt.
I am using Slurm's (version 23.02.7) collected metrics (jobacct_gather/linux) to do a performance analysis of an application.
I have read the documentation regarding the metrics (https://slurm.schedmd.com/sacct.html) but still find the Ave* metrics confusing, and more specifically the AveRSS and AveDiskWrite.
AveDiskWrite is defined as "Average number of bytes written by all tasks in job." So, if I double the workload, which say that it had x avediskwrite, I should observe 2x. So far it is what I observed. But, then, if I double the resources while maintaining the workload I observe again x, and not 2x.
So my suspicion is that the metric is the sum of written bytes across time, then divided by the number of nodes.
But then with AveRSS, defined as "Average resident set size of all tasks in job," I observe what I expected with AveDiskWrite. That is, that this metric scales with the workload irrespective of the resources it has available.
So I am not sure what the "Ave" references here.
I would be thankful if someone could clarify the behavior, and even more grateful if someone could point me where in the code these metrics are aggregated and processed to be stored in the database.
Many thanks, Manu.