[slurm-users] New Billing TRES Issue

Roberts, John E. jeroberts at anl.gov
Fri Apr 27 09:21:59 MDT 2018


I'm testing the newest version of Slurm and I'm seeing an issue when using the newer billing TRES to charge for cpu time on a partition. I've seen that billing should be used now instead of cpu in order to properly use the "TRESBillingWeights" option on a partition. 

In my test case, I gave an account 2 hours of billing time. I used 1 hour of this while setting the partition to TRESBillingWeights="CPU=1.0". It seemed to have billed properly.
Next, I set on the same partition TRESBillingWeights="CPU=0.5". I ran several jobs, but the billing never seemed to increase. RawUsage, however, did increment correctly.

Here's an examples of sshare reporting no billing run minutes, when CPU=0.5 and I start a job with a walltime of 1 hour. Even though the RawUsage is well past 2 hours, a job can still run when it shouldn't.

# sshare -A test -l -o RawUsage,GrpTRESMins,TRESRunMins%60
   RawUsage                    GrpTRESMins                                                  TRESRunMins 
----------- ------------------------------        ----------------------------------------------------- 
      11068                    billing=120                      cpu=60,mem=0,energy=0,node=60,billing=0

If I set CPU=1.0 and start say a job for 2 hours, I get this in the logs:
debug2: Job 32 being held, the job is at or exceeds assoc 239(test/(null)/(null)) group max tres(billing) minutes of 120 of which 60 are still available but request is for 120 (plus 0 already in use) tres minutes (request tres count 1)

This makes sense because I previously ran a job at the weight of 1.0 for an hour so it "billed" for 1 hour at that time. How can I query the "available" billing hours if it's not RawUsage?

Going back to setting billing CPU weight to 0.5, the logs seem to be inconsistent too. In this first line, it shows the right thing:
debug:  TRES Weight: cpu = 1.000000 * 0.500000 = 0.500000

but not a few lines down:
debug2: acct_policy_job_begin: after adding job 45, assoc 239(test/(null)/(null)) grp_used_tres_run_secs(billing) is 0

Again, RawUsage increases correctly, but Slurm is using some other field for billing to determine if a job can run.

My questions are: How can I set CPU billing to be less than 1 and how can I make sure jobs don't run if they are out of time in this case? What is Slurm using for billing, because it's clearly not RawUsage? Am I simply misunderstanding the billing and/or weights fields?

Thanks for any help...

More information about the slurm-users mailing list