[slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu on Job Arrays

Christopher Benjamin Coffey Chris.Coffey at nau.edu
Thu Jan 10 10:13:16 MST 2019


We've attempted setting JobAcctGatherFrequency=task=0 and there is no change. We have settings:

ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
JobAcctGatherType=jobacct_gather/cgroup

Odd ... wonder why we don't see it help.

Here is how we verify:

===
#!/bin/bash
#SBATCH --job-name=lazy                     # the name of your job
#SBATCH --output=/scratch/blah/lazy.txt    # this is the file your output and errors go to
#SBATCH --time=20:00                   	    # max time 
#SBATCH --workdir=/scratch/blah            # your work directory
#SBATCH --mem=7000                         # total mem
#SBATCH -c4                                 # 4 cpus

# use 500MB of memory and 1 cpu thread 
#srun stress -m 1 --vm-bytes 500M --timeout 65s

# use 500MB of memory and 3 cpu threads, 1 memory thread
srun stress -c 3 -m 1 --vm-bytes 500M --timeout 65s
===

Still have jobs with usercpu way too high.

[cbc at head-dev ~ ]$ jobstats
JobID         JobName   ReqMem    MaxRSS    ReqCPUS   UserCPU     Timelimit   Elapsed    State       JobEff  
=============================================================================================================
7957          lazy      9.77G     0.0M      4         00:00:00    00:20:00    00:00:00   FAILED      -      
7958          lazy      6.84G     0.0M      4         00:00.018   00:20:00    00:00:01   FAILED      -      
7959          lazy      6.84G     480M      4         01:51.269   00:20:00    00:01:06   COMPLETED   18.17   
7960          lazy      6.84G     499M      4         02:01.275   00:20:00    00:01:06   COMPLETED   19.53   
7961          lazy      6.84G     499M      4         01:55.259   00:20:00    00:01:06   COMPLETED   18.76   
7962          lazy      6.84G     499M      4         01:58.307   00:20:00    00:01:06   COMPLETED   19.15   
7963          lazy      6.84G     491M      4         02:01.267   00:20:00    00:01:06   COMPLETED   19.49   
7964          lazy      6.84G     499M      4         02:01.270   00:20:00    00:01:05   COMPLETED   19.73   
7965          lazy      6.84G     500M      4         02:04.336   00:20:00    00:01:05   COMPLETED   20.13   
7966          lazy      6.84G     468M      4         04:58:56    00:20:00    00:01:05   COMPLETED   2303.53   
7967          lazy      6.84G     464M      4         04:40:39    00:20:00    00:01:05   COMPLETED   2162.87   
7968          lazy      6.84G     440M      4         05:20:22    00:20:00    00:01:05   COMPLETED   2468.26   
7969          lazy      6.84G     500M      4         05:14:37    00:20:00    00:01:05   COMPLETED   2424.32   
7970          lazy      6.84G     278M      4         02:56:39    00:20:00    00:01:06   COMPLETED   1341.42   
7971          lazy      6.84G     265M      4         02:57:18    00:20:00    00:01:06   COMPLETED   1346.28   
7972          lazy      6.84G     500M      4         02:54:38    00:20:00    00:01:06   COMPLETED   1327.2   
7973          lazy      6.84G     426M      4         02:29:50    00:20:00    00:01:06   COMPLETED   1138.96   
=============================================================================================================

Requested Memory: 06.49%
Requested Cores : 2906.81%
Time Limit      : 05.47%
========================
Efficiency Score: 972.92
========================


—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 1/9/19, 7:24 AM, "slurm-users on behalf of Paddy Doyle" <slurm-users-bounces at lists.schedmd.com on behalf of paddy at tchpc.tcd.ie> wrote:

    On Wed, Jan 09, 2019 at 12:44:03PM +0100, Bj?rn-Helge Mevik wrote:
    
    > Paddy Doyle <paddy at tchpc.tcd.ie> writes:
    > 
    > > Looking back through the mailing list, it seems that from 2015 onwards the
    > > recommendation from Danny was to use 'jobacct_gather/linux' instead of
    > > 'jobacct_gather/cgroup'. I didn't pick up on that properly, so we kept with
    > > the cgroup version.
    > >
    > > Is anyone else still using jobacct_gather/cgroup and are you seeing this
    > > same issue?
    > 
    > Just a side note: In last year's SLUG, Tim recommended the following
    > settings:
    > 
    > proctrack/cgroup, task/cgroup, jobacct_gather/cgroup
    > 
    > So the recommendation for jobacct_gather might have changed -- or Danny
    > and Tim might just have different opinions. :)
    
    Interesting... the cgroups documentation page still says the performance of
    jobacct_gather/cgroup is worse than jobacct_gather/linux. Although
    according to the git commits of doc/html/cgroups.shtml, that was added to
    the page in Jan 2015, so yeah maybe things have changed again. :)
    
    https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fcgroups.html&data=02%7C01%7Cchris.coffey%40nau.edu%7C2e47d9c9330646a8245f08d6763e2346%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636826406595983378&sdata=i634oCV0NeO6DvBos05gM3iF7YxI%2FJC%2BZC7MJ222SW8%3D&reserved=0
    
    In that case, either set 'JobAcctGatherFrequency=task=0' or wait for the
    bug to be fixed.
    
    Paddy
    
    -- 
    Paddy Doyle
    Trinity Centre for High Performance Computing,
    Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
    Phone: +353-1-896-3725
    https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.tchpc.tcd.ie%2F&data=02%7C01%7Cchris.coffey%40nau.edu%7C2e47d9c9330646a8245f08d6763e2346%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636826406595983378&sdata=S2PCubxVUifigrvyEnmFdrQb5G9Ak4roM2FJtUxiM%2Fw%3D&reserved=0
    
    



More information about the slurm-users mailing list