[slurm-users] Determining Cluster Usage Rate
Juergen Salk
juergen.salk at uni-ulm.de
Sun May 16 11:28:45 UTC 2021
* Juergen Salk <juergen.salk at uni-ulm.de> [210515 23:54]:
> * Christopher Samuel <chris at csamuel.org> [210514 15:47]:
>
> > > Usage reported in Percentage of Total
> > > --------------------------------------------------------------------------------
> > >
> > > Cluster TRES Name Allocated Down PLND Dow Idle
> > > Reserved Reported
> > > --------- -------------- ------------ -------- -------- -----------
> > > -------- ------------
> > > oph cpu 81.93% 0.00% 0.00% 15.85%
> > > 2.22% 100.00%
> > > oph mem 80.60% 0.00% 0.00% 19.40%
> > > 0.00% 100.00%
> >
> > The "Reserved" column is the one you're interested in, it's indicating that
> > for the 13th some jobs were waiting for CPUs, not memory.
>
>
> However, there is also "Overcommited" in the sreport man page which
> looks promising by description - although its exact definition
> is also not completely clear to me right away:
>
> --- snip ---
>
> Overcommited
>
> Time of eligible jobs waiting in the queue over the Reserved time.
> Unlike Reserved, this has no limit. It is typically useful to
> determine whether your system is overloaded and by how much.
>
> --- snip ---
And I just noticed that this description of "Overcommited" in sreport(1)
man page first came in with versions 20.02.7 and 20.11.1, respectively.
In versions prior to 20.02.7 and 20.11.1 this still was:
--- snip ---
Overcommited
Time that the nodes were over allocated, either with the -O,
--overcommit flag at submission time or OverSubscribe set to FORCE
in the slurm.conf. This time is not counted against the total
reported time.
--- snip ---
So, I assume, the description of "Overcommited" in sreport(1) man page was
simply wrong in older versions (unless its semantics has changed with
version 20.02.7 and 20.11.1 ) ...
Best regards
Jürgen
More information about the slurm-users
mailing list