[slurm-users] sreport outputs invalid values due to corrupted data

Jean-Christophe HAESSIG haessigj at igbmc.fr
Wed Mar 9 13:19:12 UTC 2022


I recently noticed impossible usage values returned by sreport, my 
cluster was reportedly used at 100%.

Upon further investigation, I found about 6000 jobs launched on 
2020-08-31 that were 'COMPLETED' but had their CPUTime still increasing, 
amounting to about 500 days. The root cause for this seems to be a 
failure of compute nodes that were decommissioned afterwards.

To troubleshoot, I connected to the accounting database and found that 
the time_end column of the cluster_job_table table was 0 for these jobs. 
I replaced it by a meaningful value, which fixed things for sacct but 
does only have an impact on sreport queries for recent dates.

It seems that sreport takes its data from *_assoc_usage_*_table and I do 
not know how it relates to the jobs table. Is there a way to fix the data ?

J.C. Haessig

More information about the slurm-users mailing list