<div dir="ltr">I use seff all the time as a first order approximation. It's a good hint at what's going on with a job but doesn't give much detail. <div><br></div><div>We are in the process of integrating the Supremm node utilization capture tool with our clusters and with our local XDMOD installation. Plain old XDMOD can ingest the Slurm logs and give you some great information on utilization, but generally has more of a high-level or summary perspective on stats. To help see their personal job efficiency, you really need to give users time-series data and we're expecting to get that with the Supremm components.</div><div><br></div><div>The other angle which I've recently asked our eng/admin team to try to implement on our newest cluster (yet to be released), is to turn on the bits that Slurm has built-in for job profiling. With this properly configured, users can turn on job-profiling as with a Slurm job-option and it will produce that time-series data. Look for the AcctGatherProfileType config stuff for slurm.conf.</div><div><br></div><div>Best,</div><div><br></div><div>Matt</div><div><br></div><div>Matthew Brown</div><div>Computational Scientist</div><div>Advanced Research Computing</div><div>Virginia Tech</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jul 24, 2023 at 10:39 AM Will Furnell - STFC UKRI <<a href="mailto:will.furnell@stfc.ac.uk">will.furnell@stfc.ac.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><div class="msg-7549622601740210210">
<div lang="EN-GB" style="overflow-wrap: break-word;">
<div class="m_-7549622601740210210WordSection1">
<p class="MsoNormal">Hello,<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">I am aware of ‘seff’, which allows you to check the efficiency of a single job, which is good for users, but as a cluster administrator I would like to be able to track the efficiency of all jobs from all users on the cluster, so I am able
to ‘re-educate’ users that may be running jobs that have terrible resource usage efficiency.<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">What do other cluster administrators use for this task? Is there anything you use and recommend (or don’t recommend) or have heard of that is able to do this? Even if it’s something like a Grafana dashboard that hooks up to the SLURM database,<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">Thank you,<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">Will.<u></u><u></u></p>
</div>
</div>
</div></blockquote></div>