[slurm-users] TotalCPU: sacct reporting inexplicable high values

Christopher Benjamin Coffey Chris.Coffey at nau.edu
Fri Feb 1 16:06:52 UTC 2019


Nico, yep that’s a very annoying bug as we do the same here with job efficiency. It was patched in 18.08.05. However the db still needs to be cleaned up. We are working on a script to fix this. When we are done, we'll offer it up to the list.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 2/1/19, 8:47 AM, "slurm-users on behalf of nico.faerber at id.unibe.ch" <slurm-users-bounces at lists.schedmd.com on behalf of nico.faerber at id.unibe.ch> wrote:

    Hi
    
    
    While doing some statistics on efficient CPU usage, I realized that sacct is reporting inexplicable (at least for me) high values for TotalCPU, UserCPU and SystemCPU. Here is a simple example (each job step is a infinite while loop): 
    
    
    sacct -j 64338003 --format=jobid,elapsed,ncpus,cputime,totalcpu,usercpu,systemcpu,nodelist
           JobID    Elapsed      NCPUS    CPUTime   TotalCPU    UserCPU  SystemCPU        NodeList
    ------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------------
    64338003       00:02:29           4      00:09:56    13:19:41     13:19:36    00:05.054          anode033
    64338003.ba+   00:02:31        4      00:10:04    00:09.017    00:04.003  00:05.014          anode033
    64338003.ex+   00:02:30        4      00:10:00    00:00.001    00:00:00    00:00.001          anode033
    64338003.0     00:02:32          1      00:02:32    03:19:52     
    03:19:52    00:00.013          anode033
    64338003.1     00:02:32          1      00:02:32    03:19:54     
    03:19:54    00:00.008          anode033
    64338003.2     00:02:32          1      00:02:32    03:19:53     03:19:53    00:00.010          anode033
    64338003.3     00:02:32          1      00:02:32    03:19:52     
    03:19:52    00:00.007          anode033
    
    
    I would expect CPUTime to be the upper limit for TotalCPU.
    
    
    Looking at cpuacct.stat for job step3:
    
    
    cat /cgroup/cpuacct/slurm/uid_6994/job_64338003/step_3/cpuacct.stat
    user 14902       (~149 = 00:02:29)  
    system 0
    
    
    This value corresponds to the expected CPU usage of a single job step.
    
    
    We are running Slurm 18.08.4 with
    JobAcctGatherType=jobacct_gather/cgroup
    
    
    
    Does anyone have an explanation for those high values reported by sacct?
    
    
    
    
    
    Best,
    Nico
    
    
    Universitaet BernAbt. Informatikdienste
    
    Nico Färber
    High Performance Computing
    
    
    Gesellschaftsstrasse 6
    CH-3012 Bern
    Raum 104
    Tel. +41 (0)31 631 51 89
    
    
    
    
    
    
    



More information about the slurm-users mailing list