[slurm-users] seff: incorrect memory usage (18.08.5-2)

Tue Feb 26 15:12:29 UTC 2019

Hi Chris,

I had

  JobAcctGatherType=jobacct_gather/linux
  TaskPlugin=task/affinity
  ProctrackType=proctrack/cgroup

ProctrackType was actually unset but cgroup is the default.

I have now changed the settings to 

  JobAcctGatherType=jobacct_gather/cgroup
  TaskPlugin=task/affinity,task/cgroup
  ProctrackType=proctrack/cgroup

and added 

  TaskAffinity=no
  ConstrainCores=yes
  ConstrainRAMSpace=yes

For at least one job this gives me the following for a running job:

  $ seff -d 4896
  Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus
  Slurm data: 4896  loris sc RUNNING curta 8 2 2 2097152 0 0 33 3.6028797018964e+16 0

  Job ID: 4896
  Cluster: curta
  User/Group: loris/sc
  State: RUNNING
  Nodes: 2
  Cores per node: 4
  CPU Utilized: 00:00:00
  CPU Efficiency: 0.00% of 00:04:24 core-walltime
  Job Wall-clock time: 00:00:33
  Memory Utilized: 32.00 EB (estimated maximum)
  Memory Efficiency: 1717986918400.00% of 2.00 GB (256.00 MB/core)
  WARNING: Efficiency statistics may be misleading for RUNNING jobs.

and this at completion:

  $ seff -d 4896
  Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus
  Slurm data: 4896  loris sc COMPLETED curta 8 2 2 2097152 0 0 61 59400 0

  Job ID: 4896
  Cluster: curta
  User/Group: loris/sc
  State: COMPLETED (exit code 0)
  Nodes: 2
  Cores per node: 4
  CPU Utilized: 00:00:00
  CPU Efficiency: 0.00% of 00:08:08 core-walltime
  Job Wall-clock time: 00:01:01
  Memory Utilized: 58.01 MB (estimated maximum)
  Memory Efficiency: 2.83% of 2.00 GB (256.00 MB/core)

which looks good.  I'll see how it goes with longer running job.

Thanks for the input,

Loris

Christopher Benjamin Coffey <Chris.Coffey at nau.edu> writes:

> Hi Loris,
>
> Odd, we never saw that issue with memory efficiency being out of whack, just the cpu efficiency. We are running 18.08.5-2 and here is a 512 core job run last night:
>
> Job ID: 18096693
> Array Job ID: 18096693_5
> Cluster: monsoon
> User/Group: abc123/cluster
> State: COMPLETED (exit code 0)
> Nodes: 60
> Cores per node: 8
> CPU Utilized: 01:34:06
> CPU Efficiency: 58.04% of 02:42:08 core-walltime
> Job Wall-clock time: 00:00:19
> Memory Utilized: 36.04 GB (estimated maximum)
> Memory Efficiency: 30.76% of 117.19 GB (1.95 GB/node
>
> What job collection, task, and proc track plugin are you using I'm curious? We are using:
>
> JobAcctGatherType=jobacct_gather/cgroup
> TaskPlugin=task/cgroup,task/affinity
> ProctrackType=proctrack/cgroup
>
> Also cgroup.conf:
>
> ConstrainCores=yes
> ConstrainRAMSpace=yes
>
> Best,
> Chris
>
> —
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>  
>
> On 2/26/19, 2:15 AM, "slurm-users on behalf of Loris Bennett" <slurm-users-bounces at lists.schedmd.com on behalf of loris.bennett at fu-berlin.de> wrote:
>
>     Hi,
>     
>     With seff 18.08.5-2 we have been getting spurious results regarding
>     memory usage:
>     
>       $ seff 1230_27
>       Job ID: 1234
>       Array Job ID: 1230_27
>       Cluster: curta
>       User/Group: xxxxxxxxx/xxxxxxxxx
>       State: COMPLETED (exit code 0)
>       Nodes: 4
>       Cores per node: 25
>       CPU Utilized: 9-16:49:18
>       CPU Efficiency: 30.90% of 31-09:35:00 core-walltime
>       Job Wall-clock time: 07:32:09
>       Memory Utilized: 48.00 EB (estimated maximum)
>       Memory Efficiency: 26388279066.62% of 195.31 GB (1.95 GB/core)
>     
>     It seems that the more cores are involved the worse the overcalulation
>     is, but not linearly.
>     
>     Has anyone else seen this?
>     
>     Cheers,
>     
>     Loris
>     
>     -- 
>     Dr. Loris Bennett (Mr.)
>     ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de
>     
>     
>
-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de