[slurm-users] Account Usage Discrepancies

Wed Nov 15 09:55:35 MST 2017

Hi,

I'm having an issue with accounts in slurm and not sure if I'm missing
something. Here's a quick breakdown of the issue:

We have many accounts in Slurm (v16.05.10) / SlurmDBD. We recently set 1
partition's billing weight to 0.25. This partition has 64 cores with 4
threads per node. We set this weight to 0.25 so we don't bill for threads,
just core hours. This part seems to be working ok.

When querying the account balance via RawUsage (and we use sbank to present
this to the user in readable hours), these numbers look right. They come
out to a quarter of full node.

However, when querying say "UserUtilizationByAccount", this number is about
4 times as much. This also makes sense because they are technically being
allocated for all cores and threads, but we only expect to bill for a
quarter of the time.

The problem arose when a user of this account tried to submit a job and it
sat in the queue with the error "AssocGrpCPUMinutesLimit".

Turning up the debug logs showed this:

"debug2: Job 161868 being held, the job is at or exceeds assoc
2159(<foo>/(null)/(null)) group max tres(cpu) minutes of 150000000 of which
27718972 are still available but request is for 94371840 (plus 0 already in
use) tres minutes (request tres count 65536)"

The available number above "27718972" matches what the balance would have
been from the max CPU minutes minus the usage from
"UserUtilizationByAccount" instead of reporting the real balance of 4x that
number.

Why would Slurm be trying to schedule jobs based on this number instead of
RawUsage? If we're billing it lower, RawUsage should be the true balance,
but that doesn't seem to be the case.

thanks!
-John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171115/bd6b030d/attachment.html>