<div dir="ltr">I run a cluster we bought from ACT and recently updated to ClusterVisor v1.0<div><br><div>The new version has (among many things) a really nice view of individual jobs resource utilization (GPUs, memory, CPU, temperature, etc). I did not pay attention to the overall statistics, so I am not sure how CV fares there -- because I care only about individual jobs (I work with individual users, and don't care about overall utilization, which is info for the upper management). At the moment only admins can see the info, but my understanding is that they are considering making it a user-space feature, which will be really slick.</div><div><br><div>Several years ago I used XDMOD and Supremm and it was more confusing to use and had troubles collecting all the data we needed (which the team blamed on some BIOS settings), so the view was incomplete. Also, the tool seemed to be more focused on the overall stats rather than per job info (both were available, but the focus seemed on the former). I am sure these tools have improved since then, so I'm not dismissing them, just giving my opinion based on old facts. Comparing that old version of XDMOD to current CV (unfair, I know, but that's the comparison I've got) the latter wins hands down for per-job information. Also probably unfair is that XDMOD and Supremm are free and open source whereas CV is proprietary.</div></div></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jul 24, 2023 at 2:57 PM Magnus Jonsson <<a href="mailto:magnus.jonsson@umu.se">magnus.jonsson@umu.se</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg7890070509591973479">
<div lang="en-SE" style="overflow-wrap: break-word;">
<div class="m_7890070509591973479WordSection1">
<p class="MsoNormal"><span lang="EN-US" style="color:rgb(132,150,176)">We are feeding job usage information into a Prometheus database for our users (and us) to look at (via Grafana).<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:rgb(132,150,176)">It is also possible to get a lite of jobs that are under using memory, gpu or whatever metric you feed into the database.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:rgb(132,150,176)"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:rgb(132,150,176)">It’s a live feed with ~30s resolution from both compute jobs and Lustre file system.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:rgb(132,150,176)">It’s easy to extend with more metrices.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:rgb(132,150,176)"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:rgb(132,150,176)">If you want more information on what we are doing just send me an email and I can give you more information.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:rgb(132,150,176)"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US" style="color:rgb(132,150,176)">/Magnus<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="en-SE" style="color:rgb(132,150,176)"><u></u> <u></u></span></p>
<div>
<div>
<p class="MsoNormal"><span style="color:rgb(132,150,176)">-- <u></u>
<u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:rgb(132,150,176)">Magnus Jonsson, Developer, HPC2N, Umeå Universitet<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:rgb(132,150,176)">By sending an email to Umeå University, the University will need to<u></u><u></u></span></p>
</div>
</div>
<p class="MsoNormal"><span style="color:rgb(132,150,176)">process your personal data. For more information, please read
<a href="http://www.umu.se/en/gdpr" target="_blank"><span style="color:blue">www.umu.se/en/gdpr</span></a>
</span><span lang="en-SE" style="color:rgb(132,150,176)"><u></u><u></u></span></p>
<div>
<div style="border-right:none;border-bottom:none;border-left:none;border-top:1pt solid rgb(225,225,225);padding:3pt 0cm 0cm">
<p class="MsoNormal"><b><span lang="SV">Från:</span></b><span lang="SV"> slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>>
<b>För </b>Will Furnell - STFC UKRI<br>
<b>Skickat:</b> Monday, 24 July 2023 16:38<br>
<b>Till:</b> <a href="mailto:slurm-users@schedmd.com" target="_blank">slurm-users@schedmd.com</a><br>
<b>Ämne:</b> [slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)<u></u><u></u></span></p>
</div>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal"><span lang="EN-GB">Hello,<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-GB"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-GB">I am aware of ‘seff’, which allows you to check the efficiency of a single job, which is good for users, but as a cluster administrator I would like to be able to track the efficiency of all jobs from all users on the
cluster, so I am able to ‘re-educate’ users that may be running jobs that have terrible resource usage efficiency.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-GB"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-GB">What do other cluster administrators use for this task? Is there anything you use and recommend (or don’t recommend) or have heard of that is able to do this? Even if it’s something like a Grafana dashboard that hooks
up to the SLURM database,<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-GB"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-GB">Thank you,<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-GB"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-GB">Will.<u></u><u></u></span></p>
</div>
</div>
</div></blockquote></div>