<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
</head><body style="">
<div>
Dear Benson,
</div>
<div>
</div>
<div>
We have not set the property of "oversubscription " in slurm.conf, so the default is no.
</div>
<div>
Yes, sacct and database are showing correct results.
</div>
<div>
When we see job details for that user in sacct, it doesnot showing any job in last 10 days.
</div>
<div>
Only sreport is showing incorrect utilisation for some users which we have removed from slurm account, yet they are still showing utilisation.
</div>
<div>
</div>
<div>
</div>
<div>
Below is the accountutilizationbyuser and utilisation report generated by sreport in the cluster.
</div>
<div>
</div>
<div>
#sreport cluster accountutilizationbyuser -t per start=073022T11:00:00 end=now
</div>
<div>
<p>--------------------------------------------------------------------------------<br/>Cluster/Account/User Utilization 2022-07-30T11:00:00 - 2022-08-01T10:59:59 (172800 secs)<br/>Usage reported in Percentage of Total</p>
<p>------------------------------------------------------------------<br/> Cluster Account Login Used Energy<br/>--------- --------------- --------- ---------- -------------------<br/>cluster+ root 842.73% 100.00%<br/>cluster+ root root 464.66% 0.00%<br/>cluster+ hpc 293.57% 0.00%<br/>cluster+ hpc user01 51.33% 0.00%<br/>cluster+ hpc user02 242.24% 0.00%<br/>cluster+ phy 73.79% 99.85%<br/>cluster+ phy user03 0.32% 0.00%</p>
<p> </p>
</div>
<div>
<p>#sreport cluster utilisation -t per start=080122 end=now<br/>--------------------------------------------------------------------------------<br/>Cluster Utilization 2022-08-01T00:00:00 - 2022-08-01T10:59:59<br/>Usage reported in Percentage of Total<br/>------------------------------------------------------------------------------------<br/> Cluster Allocated Down PLND Dow Idle Reserved Reported<br/>--------- ---------- -------- -------- -------- -------- ----------------------------<br/>cluster+ 100.00% 0.00% 0.00% 0.00% 0.00% 100.00%</p>
</div>
<div>
</div>
<div>
Also, the cluster is showing1-2 runaway jobs everyday from the same time period from which the cluster started showing this issue. We remove them on a daily basis.
</div>
<div>
</div>
<div>
<div class="DTXlsb">
On 7/29/22 11:59, mshubham wrote:
<br/>> Dear All,
<br/>> I am facing an issue in SLURM(20.11.8), in which sreport cluster
<br/>> utilization is 100%, and when I run sreport cluster
<br/>> userutilizationbyaccount, Some user utilisation is greater than 100%,
<br/>> three users including root showing utilisation over 250%, making overall
<br/>> utilisation 500% (though user has not submitted any job in past one week)
<br/>> It was showing some runaway jobs, but we cleared it, then again, it was
<br/>> showing same runaway jobs, and we cleared it again. (both
<br/>> manually/through command)
</div> Is oversubscription enabled?
<br/>
<a target="_blank" data-saferedirecturl="https://www.google.com/url?hl=en&q=https://slurm.schedmd.com/sreport.html%23SECTION_REPORT-TYPES&source=gmail&ust=1659191470153000&usg=AOvVaw0U5yJ9ZAEW_UTh9CaOtvpV" href="https://slurm.schedmd.com/sreport.html#SECTION_REPORT-TYPES">https://slurm.schedmd.com/sreport.html#SECTION_REPORT-TYPES</a>
<br/>Do you get similar results with sacct?
<br/>
<div class="DTXlsb">
<br/>> Before that, we had encountered an issue in the past in which, in our
<br/>> cluster with primary and backup slurm controller, we kept a common
<br/>> mount point for the "StateSaveLocation" /var/share/slurm/ctld. Then we
<br/>> observed a strange behaviour that " If the mount point is present and
<br/>> the service is restarted on the primary controller then it replaces all
<br/>> the statesavelocation files."
<br/>>
<br/>> This resulted in cancellation of all the jobs (running, pending state),
<br/>> reservations and assigns the JobID from 1 for newly submitted jobs. If
<br/>> the SateSaveLocation is kept on local file system instead of shared
<br/>> mount point then everything works fine even after restarting the
<br/>> slurmctld service.
<br/>>
<br/>> After that issue, utilisation is higher than expected, though it has not
<br/>> impacted any real job utilisation.
<br/>>
<br/>> Also, we have removed those user's account in SLURM, yet it is still
<br/>> showing their utilisation
<br/>>
</div> The database should keep previous utilization records.
<br/>
<div class="wqmMgb">
<div class="uArJ5e Y5FYJe cjq2Db gU0jsb M9Bg4d" data-tooltip="Hide expanded content" data-tooltip-vertical-offset="-12" data-tooltip-horizontal-offset="0">
<div class="PDXc1b MbhUzd">
</div>
</div>
<br/>> Please help in resolving this issue.
<br/>>
<br/>> Thanks and Regards,
<br/>> Shubham Mehta
<br/>> HPC Technology
<br/>> CDAC Pune
</div>
<div class="wqmMgb">
</div>
</div>
<div id="ox-signature">
Thanks and Regards,
<br/>Shubham Mehta
<br/>HPC Technology
<br/>CDAC Pune
</div>
<br />------------------------------------------------------------------------------------------------------------
<br />[ C-DAC is on Social-Media too. Kindly follow us at:
<br />Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]
<br />
<br />This e-mail is for the sole use of the intended recipient(s) and may
<br />contain confidential and privileged information. If you are not the
<br />intended recipient, please contact the sender by reply e-mail and destroy
<br />all copies and the original message. Any unauthorized review, use,
<br />disclosure, dissemination, forwarding, printing or copying of this email
<br />is strictly prohibited and appropriate legal action will be taken.
<br />------------------------------------------------------------------------------------------------------------
</body></html>