[slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu on Job Arrays
Christopher Benjamin Coffey
Chris.Coffey at nau.edu
Thu Jan 10 10:13:16 MST 2019
We've attempted setting JobAcctGatherFrequency=task=0 and there is no change. We have settings:
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
JobAcctGatherType=jobacct_gather/cgroup
Odd ... wonder why we don't see it help.
Here is how we verify:
===
#!/bin/bash
#SBATCH --job-name=lazy # the name of your job
#SBATCH --output=/scratch/blah/lazy.txt # this is the file your output and errors go to
#SBATCH --time=20:00 # max time
#SBATCH --workdir=/scratch/blah # your work directory
#SBATCH --mem=7000 # total mem
#SBATCH -c4 # 4 cpus
# use 500MB of memory and 1 cpu thread
#srun stress -m 1 --vm-bytes 500M --timeout 65s
# use 500MB of memory and 3 cpu threads, 1 memory thread
srun stress -c 3 -m 1 --vm-bytes 500M --timeout 65s
===
Still have jobs with usercpu way too high.
[cbc at head-dev ~ ]$ jobstats
JobID JobName ReqMem MaxRSS ReqCPUS UserCPU Timelimit Elapsed State JobEff
=============================================================================================================
7957 lazy 9.77G 0.0M 4 00:00:00 00:20:00 00:00:00 FAILED -
7958 lazy 6.84G 0.0M 4 00:00.018 00:20:00 00:00:01 FAILED -
7959 lazy 6.84G 480M 4 01:51.269 00:20:00 00:01:06 COMPLETED 18.17
7960 lazy 6.84G 499M 4 02:01.275 00:20:00 00:01:06 COMPLETED 19.53
7961 lazy 6.84G 499M 4 01:55.259 00:20:00 00:01:06 COMPLETED 18.76
7962 lazy 6.84G 499M 4 01:58.307 00:20:00 00:01:06 COMPLETED 19.15
7963 lazy 6.84G 491M 4 02:01.267 00:20:00 00:01:06 COMPLETED 19.49
7964 lazy 6.84G 499M 4 02:01.270 00:20:00 00:01:05 COMPLETED 19.73
7965 lazy 6.84G 500M 4 02:04.336 00:20:00 00:01:05 COMPLETED 20.13
7966 lazy 6.84G 468M 4 04:58:56 00:20:00 00:01:05 COMPLETED 2303.53
7967 lazy 6.84G 464M 4 04:40:39 00:20:00 00:01:05 COMPLETED 2162.87
7968 lazy 6.84G 440M 4 05:20:22 00:20:00 00:01:05 COMPLETED 2468.26
7969 lazy 6.84G 500M 4 05:14:37 00:20:00 00:01:05 COMPLETED 2424.32
7970 lazy 6.84G 278M 4 02:56:39 00:20:00 00:01:06 COMPLETED 1341.42
7971 lazy 6.84G 265M 4 02:57:18 00:20:00 00:01:06 COMPLETED 1346.28
7972 lazy 6.84G 500M 4 02:54:38 00:20:00 00:01:06 COMPLETED 1327.2
7973 lazy 6.84G 426M 4 02:29:50 00:20:00 00:01:06 COMPLETED 1138.96
=============================================================================================================
Requested Memory: 06.49%
Requested Cores : 2906.81%
Time Limit : 05.47%
========================
Efficiency Score: 972.92
========================
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 1/9/19, 7:24 AM, "slurm-users on behalf of Paddy Doyle" <slurm-users-bounces at lists.schedmd.com on behalf of paddy at tchpc.tcd.ie> wrote:
On Wed, Jan 09, 2019 at 12:44:03PM +0100, Bj?rn-Helge Mevik wrote:
> Paddy Doyle <paddy at tchpc.tcd.ie> writes:
>
> > Looking back through the mailing list, it seems that from 2015 onwards the
> > recommendation from Danny was to use 'jobacct_gather/linux' instead of
> > 'jobacct_gather/cgroup'. I didn't pick up on that properly, so we kept with
> > the cgroup version.
> >
> > Is anyone else still using jobacct_gather/cgroup and are you seeing this
> > same issue?
>
> Just a side note: In last year's SLUG, Tim recommended the following
> settings:
>
> proctrack/cgroup, task/cgroup, jobacct_gather/cgroup
>
> So the recommendation for jobacct_gather might have changed -- or Danny
> and Tim might just have different opinions. :)
Interesting... the cgroups documentation page still says the performance of
jobacct_gather/cgroup is worse than jobacct_gather/linux. Although
according to the git commits of doc/html/cgroups.shtml, that was added to
the page in Jan 2015, so yeah maybe things have changed again. :)
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fcgroups.html&data=02%7C01%7Cchris.coffey%40nau.edu%7C2e47d9c9330646a8245f08d6763e2346%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636826406595983378&sdata=i634oCV0NeO6DvBos05gM3iF7YxI%2FJC%2BZC7MJ222SW8%3D&reserved=0
In that case, either set 'JobAcctGatherFrequency=task=0' or wait for the
bug to be fixed.
Paddy
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.tchpc.tcd.ie%2F&data=02%7C01%7Cchris.coffey%40nau.edu%7C2e47d9c9330646a8245f08d6763e2346%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636826406595983378&sdata=S2PCubxVUifigrvyEnmFdrQb5G9Ak4roM2FJtUxiM%2Fw%3D&reserved=0
More information about the slurm-users
mailing list