[slurm-users] slurm memory cgroup seems to have vanished

Wed Apr 4 10:00:31 MDT 2018

Hi,

For some reason I am seeing memory cgroups disappear on the nodes:

[root at node3108 memory]# file $PWD/slurm
/sys/fs/cgroup/memory/slurm: cannot open (No such file or directory)

There is however a job running and other cgroups are still present:

[root at node3108 memory]# ls /sys/fs/cgroup/cpu,cpuacct/slurm/
cgroup.clone_children  cgroup.procs  cpuacct.usage	   cpu.cfs_period_us  cpu.rt_period_us	 cpu.shares  notify_on_release	uid_2540915  uid_2541917  uid_2541963
cgroup.event_control   cpuacct.stat  cpuacct.usage_percpu  cpu.cfs_quota_us   cpu.rt_runtime_us  cpu.stat    tasks		uid_2540941  uid_2541948

The config has:

[root at node3108 memory]# grep "CR_C" /etc/slurm/slurm.conf
SelectTypeParameters=CR_Core_Memory

[root at node3108 memory]# grep "cgr" /etc/slurm/slurm.conf
JobAcctGatherType=jobacct_gather/cgroup
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity,task/cgroup

[root at node3108 memory]# cat /etc/slurm/cgroup.conf

AllowedSwapSpace=10
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
TaskAffinity=yes

Slurmd logs show:

[2018-04-03T08:00:04.224] [4389.extern] Considering each NUMA node as a socket
[2018-04-03T08:00:04.240] [4389.extern] task/cgroup: /slurm/uid_2540941/job_4389: alloc=184300MB mem.limit=184300MB memsw.limit=202730MB
[2018-04-03T08:00:04.240] [4389.extern] task/cgroup: /slurm/uid_2540941/job_4389/step_extern: alloc=184300MB mem.limit=184300MB memsw.limit=202730MB
[2018-04-03T08:00:04.492] task_p_slurmd_batch_request: 4389
[2018-04-03T08:00:04.492] task/affinity: job 4389 CPU input mask for node: 0xFFFFFFFFF
[2018-04-03T08:00:04.492] task/affinity: job 4389 CPU final HW mask for node: 0xFFFFFFFFF
[2018-04-03T08:00:05.534] Launching batch job 4389 for UID 2540941
[2018-04-03T08:00:05.587] [4389.batch] Considering each NUMA node as a socket
[2018-04-03T08:00:05.597] [4389.batch] task/cgroup: /slurm/uid_2540941/job_4389: alloc=184300MB mem.limit=184300MB memsw.limit=202730MB
[2018-04-03T08:00:05.598] [4389.batch] task/cgroup: /slurm/uid_2540941/job_4389/step_batch: alloc=184300MB mem.limit=184300MB memsw.limit=202730MB
[2018-04-03T08:00:05.668] [4389.batch] task_p_pre_launch: Using sched_affinity for tasks
[2018-04-03T08:00:08.594] launch task 4389.0 request from 2540941.2540941 at 10.141.4.9 (port 5299)
[2018-04-03T08:00:08.594] lllp_distribution jobid [4389] auto binding off: mask_cpu,one_thread
[2018-04-03T08:00:08.645] [4389.0] Considering each NUMA node as a socket
[2018-04-03T08:00:08.654] [4389.0] task/cgroup: /slurm/uid_2540941/job_4389: alloc=184300MB mem.limit=184300MB memsw.limit=202730MB
[2018-04-03T08:00:08.655] [4389.0] task/cgroup: /slurm/uid_2540941/job_4389/step_0: alloc=184300MB mem.limit=184300MB memsw.limit=202730MB
[2018-04-03T08:00:08.669] [4389.0] task_p_pre_launch: Using sched_affinity for tasks
[2018-04-04T17:45:51.336] [4389.0] _oom_event_monitor: oom-kill event count: 1
[2018-04-04T17:45:51.524] [4389.batch] _oom_event_monitor: oom-kill event count: 1
[2018-04-04T17:45:51.691] [4389.extern] _oom_event_monitor: oom-kill event count: 1

Currently, these processes are running as the job on the node:

root     425230  0.0  0.0 308260  4616 ?        Sl   Apr03   0:09 slurmstepd: [4389.extern]
root     425235  0.0  0.0 107904   604 ?        S    Apr03   0:00  \_ sleep 1000000
root     425329  0.0  0.0 308592  4940 ?        Sl   Apr03   0:43 slurmstepd: [4389.batch]
vsc40941 425334  0.0  0.0 113280  1660 ?        S    Apr03   0:00  \_ /bin/bash /var/spool/slurm/slurmd/job04389/slurm_script
vsc40941 425584  0.0  0.0 223660 14920 ?        S    Apr03   0:04      \_ /usr/bin/python /apps/gent/CO7/skylake-ib-PILOT/software/vsc-mympirun/4.1.0/bin/mympirun --hybrid 8 --output /user/scratch/gent/gvo000/gvo00003/vsc40941/pilot_testi
vsc40941 425602  0.0  0.0 113284  1548 ?        S    Apr03   0:00          \_ /bin/sh /apps/gent/CO7/skylake-ib-PILOT/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.28/bin64/mpirun --file=/user/home/gent/vsc409/vsc40941/.mympiru
vsc40941 425607  0.0  0.0  15916  1640 ?        S    Apr03   0:00              \_ mpiexec.hydra --file=/user/home/gent/vsc409/vsc40941/.mympirun_7xwr8q/4389_20180403_080008/mpdboot --machinefile /user/home/gent/vsc409/vsc40941/.mympirun_7
vsc40941 425608  0.0  0.0 252972  4800 ?        Sl   Apr03   0:00                  \_ /bin/srun --nodelist node3108.skitty.os -N 1 -n 1 --input none /apps/gent/CO7/skylake-ib-PILOT/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.
vsc40941 425609  0.0  0.0  48204   712 ?        S    Apr03   0:00                      \_ /bin/srun --nodelist node3108.skitty.os -N 1 -n 1 --input none /apps/gent/CO7/skylake-ib-PILOT/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.
root     425617  0.0  0.0 376888  4676 ?        Sl   Apr03   1:06 slurmstepd: [4389.0]
vsc40941 425624  0.0  0.0  19340  1928 ?        S    Apr03   0:00  \_ /apps/gent/CO7/skylake-ib-PILOT/software/impi/2018.1.163-iccifort-2018.1.163-GCC-6.4.0-2.28/bin64/pmi_proxy --control-port node3108.skitty.os:44097 --pmi-connect alltoa
vsc40941 425628 99.7  0.6 1754940 1305896 ?     Rl   Apr03 2021:06      \_ vasp
vsc40941 425629 99.7  0.6 1770904 1320548 ?     Rl   Apr03 2021:01      \_ vasp
vsc40941 425630 99.7  0.6 1768488 1325680 ?     Rl   Apr03 2020:47      \_ vasp
vsc40941 425631 99.7  0.6 1745188 1314856 ?     Rl   Apr03 2021:28      \_ vasp
vsc40941 425632 99.7  0.6 1786948 1346932 ?     Rl   Apr03 2021:35      \_ vasp
vsc40941 425633 99.7  0.6 1755904 1318080 ?     Rl   Apr03 2020:47      \_ vasp
vsc40941 425634 99.7  0.6 1740088 1300324 ?     Rl   Apr03 2021:36      \_ vasp
vsc40941 425635 99.7  0.6 1751500 1312788 ?     Rl   Apr03 2021:34      \_ vasp

Furthermore I see the following open files

[root at node3108 memory]# lsof -p 425624
<snip>
pmi_proxy 425624 vsc40941   10r      REG               0,24        0   12195624 /sys/fs/cgroup/memory/slurm/uid_2540941/job_4389/step_0/memory.oom_control (deleted)
pmi_proxy 425624 vsc40941   11w      REG               0,24        0   12195612 /sys/fs/cgroup/memory/slurm/uid_2540941/job_4389/step_0/cgroup.event_control (deleted)

Anything you can point me to look at to understand why the cgroup has gone? We’ve seen this with other (similar) jobs, who were killed by OOM-killer as they exceeded the cgroup memory limit (even though the user claims there should nor be more than 1G used per node). I am not sure this is related, but we’d prefer to keep the memory cgroups around. There, once the nodes were empty a new jobs recreated the memory cgroup hierarchy.

Given that

[2018-04-03T08:00:05.597] [4389.batch] task/cgroup: /slurm/uid_2540941/job_4389: alloc=184300MB mem.limit=184300MB memsw.limit=202730MB

is a line in the logs, I am assuming that the memory cgroup for the job was still present when that line was logged, would that be correct?

Thanks in advance,
— Andy

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180404/47754bb2/attachment.sig>