[slurm-users] seff: incorrect memory usage (18.08.5-2)

Mon Mar 4 17:10:19 UTC 2019

You are welcome Loris!

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167

On 2/26/19, 8:16 AM, "slurm-users on behalf of Loris Bennett" <slurm-users-bounces at lists.schedmd.com on behalf of loris.bennett at fu-berlin.de> wrote:

    Hi Chris,

    I had

      JobAcctGatherType=jobacct_gather/linux
      TaskPlugin=task/affinity
      ProctrackType=proctrack/cgroup

    ProctrackType was actually unset but cgroup is the default.

    I have now changed the settings to 

      JobAcctGatherType=jobacct_gather/cgroup
      TaskPlugin=task/affinity,task/cgroup
      ProctrackType=proctrack/cgroup

    and added 

      TaskAffinity=no
      ConstrainCores=yes
      ConstrainRAMSpace=yes

    For at least one job this gives me the following for a running job:

      $ seff -d 4896
      Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus
      Slurm data: 4896  loris sc RUNNING curta 8 2 2 2097152 0 0 33 3.6028797018964e+16 0

      Job ID: 4896
      Cluster: curta
      User/Group: loris/sc
      State: RUNNING
      Nodes: 2
      Cores per node: 4
      CPU Utilized: 00:00:00
      CPU Efficiency: 0.00% of 00:04:24 core-walltime
      Job Wall-clock time: 00:00:33
      Memory Utilized: 32.00 EB (estimated maximum)
      Memory Efficiency: 1717986918400.00% of 2.00 GB (256.00 MB/core)
      WARNING: Efficiency statistics may be misleading for RUNNING jobs.

    and this at completion:

      $ seff -d 4896
      Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus
      Slurm data: 4896  loris sc COMPLETED curta 8 2 2 2097152 0 0 61 59400 0

      Job ID: 4896
      Cluster: curta
      User/Group: loris/sc
      State: COMPLETED (exit code 0)
      Nodes: 2
      Cores per node: 4
      CPU Utilized: 00:00:00
      CPU Efficiency: 0.00% of 00:08:08 core-walltime
      Job Wall-clock time: 00:01:01
      Memory Utilized: 58.01 MB (estimated maximum)
      Memory Efficiency: 2.83% of 2.00 GB (256.00 MB/core)

    which looks good.  I'll see how it goes with longer running job.

    Thanks for the input,

    Loris

    Christopher Benjamin Coffey <Chris.Coffey at nau.edu> writes:

    > Hi Loris,
    >
    > Odd, we never saw that issue with memory efficiency being out of whack, just the cpu efficiency. We are running 18.08.5-2 and here is a 512 core job run last night:
    >
    > Job ID: 18096693
    > Array Job ID: 18096693_5
    > Cluster: monsoon
    > User/Group: abc123/cluster
    > State: COMPLETED (exit code 0)
    > Nodes: 60
    > Cores per node: 8
    > CPU Utilized: 01:34:06
    > CPU Efficiency: 58.04% of 02:42:08 core-walltime
    > Job Wall-clock time: 00:00:19
    > Memory Utilized: 36.04 GB (estimated maximum)
    > Memory Efficiency: 30.76% of 117.19 GB (1.95 GB/node
    >
    > What job collection, task, and proc track plugin are you using I'm curious? We are using:
    >
    > JobAcctGatherType=jobacct_gather/cgroup
    > TaskPlugin=task/cgroup,task/affinity
    > ProctrackType=proctrack/cgroup
    >
    > Also cgroup.conf:
    >
    > ConstrainCores=yes
    > ConstrainRAMSpace=yes
    >
    > Best,
    > Chris
    >
    > —
    > Christopher Coffey
    > High-Performance Computing
    > Northern Arizona University
    > 928-523-1167
    >  
    >
    > On 2/26/19, 2:15 AM, "slurm-users on behalf of Loris Bennett" <slurm-users-bounces at lists.schedmd.com on behalf of loris.bennett at fu-berlin.de> wrote:
    >
    >     Hi,
    >     
    >     With seff 18.08.5-2 we have been getting spurious results regarding
    >     memory usage:
    >     
    >       $ seff 1230_27
    >       Job ID: 1234
    >       Array Job ID: 1230_27
    >       Cluster: curta
    >       User/Group: xxxxxxxxx/xxxxxxxxx
    >       State: COMPLETED (exit code 0)
    >       Nodes: 4
    >       Cores per node: 25
    >       CPU Utilized: 9-16:49:18
    >       CPU Efficiency: 30.90% of 31-09:35:00 core-walltime
    >       Job Wall-clock time: 07:32:09
    >       Memory Utilized: 48.00 EB (estimated maximum)
    >       Memory Efficiency: 26388279066.62% of 195.31 GB (1.95 GB/core)
    >     
    >     It seems that the more cores are involved the worse the overcalulation
    >     is, but not linearly.
    >     
    >     Has anyone else seen this?
    >     
    >     Cheers,
    >     
    >     Loris
    >     
    >     -- 
    >     Dr. Loris Bennett (Mr.)
    >     ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de
    >     
    >     
    >
    -- 
    Dr. Loris Bennett (Mr.)
    ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de