[slurm-users] can't create memory group (cgroup)

Janne Blomqvist janne.blomqvist at aalto.fi
Mon Sep 10 00:42:00 MDT 2018

On 2018-09-07 18:53, Mike Cammilleri wrote:
> Hi everyone,
> I'm getting this error lately for everyone's jobs, which results in memory not being constrained via the cgroups plugin.
> slurmstepd: error: task/cgroup: unable to add task[pid=21681] to memory cg '(null)'
> slurmstepd: error: jobacct_gather/cgroup: unable to instanciate user 3691 memory cgroup
> The result is that no uid_ direcotries are created under /sys/fs/cgroup/memory
> Here is our cgroup.conf file:
> CgroupAutomount=yes
> CgroupReleaseAgentDir="/etc/cgroup"
> CgroupMountpoint=/sys/fs/cgroup
> ConstrainCores=yes
> ConstrainDevices=no
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=yes
> AllowedSwapSpace=0
> We are using jobacct_gather/cgroup
> JobAcctGatherType=jobacct_gather/cgroup
> The partition is configured like this
> PartitionName=long Nodes=marzano[05-13] PriorityTier=30 Default=NO MaxTime=5-0 State=UP OverSubscribe=FORCE:1
> We are using slurm 16.05.6 on Ubuntu 14.04 LTS
> Any ideas how to get cgroups going again?

This is, apparently, a bug in the Linux kernel where it doesn't garbage 
collect deleted memory cgroups. Eventually the kernel hits an internal 
limit on how many memory cgroups there can be, and refuses to create more.

This bug has apparently been fixed in the upstream kernel, but is still 
present at least in the CentOS 7 kernel, and based on your report, in 
the Ubuntu 14.04 kernel.

One workaround is to reboot the node whenever this happens.  Another is 
to set ConstrainKmemSpace=no is cgroup.conf (but AFAICS this option was 
added in slurm 17.02 and is not present in 16.05 that you're using).

For more information, see discussion and links in slurm bug #5082.

Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqvist at aalto.fi

More information about the slurm-users mailing list