[slurm-users] Determining Cluster Usage Rate

Sun May 16 11:28:45 UTC 2021

* Juergen Salk <juergen.salk at uni-ulm.de> [210515 23:54]:
> * Christopher Samuel <chris at csamuel.org> [210514 15:47]:
> 
> > > Usage reported in Percentage of Total
> > > --------------------------------------------------------------------------------
> > > 
> > >    Cluster      TRES Name    Allocated     Down PLND Dow        Idle
> > > Reserved     Reported
> > > --------- -------------- ------------ -------- -------- -----------
> > > -------- ------------
> > >        oph            cpu       81.93%    0.00%    0.00%      15.85%
> > > 2.22%      100.00%
> > >        oph            mem       80.60%    0.00%    0.00%      19.40%
> > > 0.00%      100.00%
> > 
> > The "Reserved" column is the one you're interested in, it's indicating that
> > for the 13th some jobs were waiting for CPUs, not memory.
> 
> 
> However, there is also "Overcommited" in the sreport man page which
> looks promising by description - although its exact definition 
> is also not completely clear to me right away:
> 
> --- snip ---
> 
> Overcommited  
> 
>    Time of eligible jobs waiting in the queue over the Reserved time.
>    Unlike Reserved, this has no limit. It is typically useful to
>    determine whether your system is overloaded and by how much.
> 
> --- snip ---

And I just noticed that this description of "Overcommited" in sreport(1) 
man page first came in with versions 20.02.7 and 20.11.1, respectively.

In versions prior to 20.02.7 and 20.11.1 this still was:

--- snip ---

Overcommited

   Time that the nodes were over allocated, either with the -O,
   --overcommit flag at submission time or OverSubscribe set to FORCE
   in the slurm.conf. This time is not counted against the total
   reported time.

--- snip ---

So, I assume, the description of "Overcommited" in sreport(1) man page was 
simply wrong in older versions (unless its semantics has changed with
version 20.02.7 and 20.11.1 ) ...

Best regards
Jürgen