[slurm-users] Account Usage Discrepancies
Kevin M. Hildebrand
kevin at umd.edu
Tue Nov 28 06:26:47 MST 2017
Sounds suspiciously similar to a bug we reported a very long time ago,
and that I'd submitted a patch for:
https://bugs.schedmd.com/show_bug.cgi?id=1048
Which was then revisited here:
https://bugs.schedmd.com/show_bug.cgi?id=2423
Though my fix handles a problem with a UsageFactor other than 1, I'm
wondering if the problem is the same with BillingWeight too.
Kevin
On Mon, Nov 27, 2017 at 5:06 PM, John Roberts
<roberts.johneric at gmail.com> wrote:
> Hoping someone will get eyes on this one. I ended up changing the partition
> in question to only use 1 thread per core to keep things simple, but it
> would still be nice to know why slurm is looking at TRES hours instead of
> RawUsage.
>
> thanks.
> -John
>
> On Wed, Nov 15, 2017 at 10:55 AM, John Roberts <roberts.johneric at gmail.com>
> wrote:
>>
>> Hi,
>>
>> I'm having an issue with accounts in slurm and not sure if I'm missing
>> something. Here's a quick breakdown of the issue:
>>
>> We have many accounts in Slurm (v16.05.10) / SlurmDBD. We recently set 1
>> partition's billing weight to 0.25. This partition has 64 cores with 4
>> threads per node. We set this weight to 0.25 so we don't bill for threads,
>> just core hours. This part seems to be working ok.
>>
>> When querying the account balance via RawUsage (and we use sbank to
>> present this to the user in readable hours), these numbers look right. They
>> come out to a quarter of full node.
>>
>> However, when querying say "UserUtilizationByAccount", this number is
>> about 4 times as much. This also makes sense because they are technically
>> being allocated for all cores and threads, but we only expect to bill for a
>> quarter of the time.
>>
>> The problem arose when a user of this account tried to submit a job and it
>> sat in the queue with the error "AssocGrpCPUMinutesLimit".
>>
>> Turning up the debug logs showed this:
>>
>> "debug2: Job 161868 being held, the job is at or exceeds assoc
>> 2159(<foo>/(null)/(null)) group max tres(cpu) minutes of 150000000 of which
>> 27718972 are still available but request is for 94371840 (plus 0 already in
>> use) tres minutes (request tres count 65536)"
>>
>> The available number above "27718972" matches what the balance would have
>> been from the max CPU minutes minus the usage from
>> "UserUtilizationByAccount" instead of reporting the real balance of 4x that
>> number.
>>
>> Why would Slurm be trying to schedule jobs based on this number instead of
>> RawUsage? If we're billing it lower, RawUsage should be the true balance,
>> but that doesn't seem to be the case.
>>
>> thanks!
>> -John
>
>
More information about the slurm-users
mailing list