<div style="color:black;font: 10pt arial;">Hello,
<div><br>
</div>
<div>for quite some time we are trying to trace down an issue with Slurm slots being consumed also the "job process" already completed on the given node. The environment is using Slurm 20.11.8.</div>
<div>Unfortunately I cannot relate the issue to a specific configuration change.</div>
<div><br>
</div>
<div>It looks like the OS process finishes successfully however "Slurm" cleanup of the process or cgroups and so on cannot be completed (guessing).</div>
<div>I was wondering if it relates to the cgropu settings (although those are in place since initial setup):</div>
<div>
<div># scontrol show config|grep -i cgroup</div>
<div>JobAcctGatherType = jobacct_gather/cgroup ############ might it conflict with ProctrackType?</div>
<div>ProctrackType = proctrack/cgroup</div>
<div>TaskPlugin = task/cgroup,task/affinity</div>
<div>Cgroup Support Configuration:</div>
<div>AllowedDevicesFile = /etc/slurm/cgroup_allowed_devices_file.conf</div>
<div>CgroupAutomount = yes</div>
<div>CgroupMountpoint = /sys/fs/cgroup</div>
</div>
<div><br>
</div>
<div><br>
</div>
<div>Details:</div>
<div>
<div># squeue |grep mynode</div>
<div> 22144064 mynode ABC.job jobuser R 23:50:05 1 mynode</div>
<div><span style="white-space: pre;"> </span> </div>
<div># ps aux|grep 22144064</div>
<div>root 61565 0.0 0.0 480064 11880 ? Sl Jun01 0:16 slurmstepd: [22144064.extern]</div>
<div>root 61834 0.0 0.0 490808 9628 ? Sl Jun01 0:04 slurmstepd: [22144064.batch]</div>
<div><br>
</div>
<div>^^^^^^^^ there's no subprocess slurm_script or any process</div>
</div>
<div><br>
</div>
<div>
<div># strace -ff -p 61834</div>
<div>strace: Process 61834 attached with 5 threads</div>
<div>[pid 61895] restart_syscall(<... resuming interrupted read ...> <unfinished ...></div>
<div>[pid 61838] restart_syscall(<... resuming interrupted poll ...> <unfinished ...></div>
<div>[pid 61837] restart_syscall(<... resuming interrupted futex ...> <unfinished ...></div>
<div>[pid 61836] futex(0x2b8cf914455c, FUTEX_WAIT_PRIVATE, 5767, NULL <unfinished ...></div>
<div>[pid 61834] futex(0x15f6f30, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...></div>
<div>[pid 61837] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)</div>
<div>[pid 61837] futex(0x2b8cf9142120, FUTEX_WAKE_PRIVATE, 1) = 0</div>
<div>[pid 61837] futex(0x2b8cf91420e4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 172939, {tv_sec=1654171864, tv_nsec=391983000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)</div>
<div>[pid 61837] futex(0x2b8cf9142120, FUTEX_WAKE_PRIVATE, 1) = 0</div>
<div>[pid 61837] futex(0x2b8cf91420e4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 172941, {tv_sec=1654171865, tv_nsec=391983000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)</div>
<div>[pid 61837] futex(0x2b8cf9142120, FUTEX_WAKE_PRIVATE, 1) = 0</div>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div>slurmd.log</div>
<div>
<div> [22144064.extern] debug2: profile signaling type Task</div>
<div> [22144064.extern] debug2: jobacct_gather/cgroup: jag_common_poll_data: energycounted = 0</div>
<div> [22144064.batch] debug2: profile signaling type Task</div>
<div> [22144064.extern] debug2: jobacct_gather/cgroup: jag_common_poll_data: jag_common_poll_data: energy = 0 watts = 0 ave_watts = 0</div>
<div> [22144064.extern] debug: jobacct_gather/cgroup: jag_common_poll_data: jag_common_poll_data: Task 0 pid 61814 ave_freq = 1846 mem size/max 0/0 vmem size/max 1</div>
<div>14872320/114872320, disk read size/max (3676/3676), disk write size/max (0/0), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0</div>
<div> [22144064.batch] debug2: profile signaling type Task</div>
<div> [22144064.extern] debug2: profile signaling type Task</div>
<div> [22144064.extern] debug2: jobacct_gather/cgroup: jag_common_poll_data: energycounted = 0</div>
<div> [22144064.extern] debug2: jobacct_gather/cgroup: jag_common_poll_data: jag_common_poll_data: energy = 0 watts = 0 ave_watts = 0</div>
<div> [22144064.extern] debug: jobacct_gather/cgroup: jag_common_poll_data: jag_common_poll_data: Task 0 pid 61814 ave_freq = 1846 mem size/max 0/0 vmem size/max 1</div>
<div>14872320/114872320, disk read size/max (3676/3676), disk write size/max (0/0), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0</div>
<div> [22144064.extern] debug2: acct_gather_profile/influxdb: _send_data: acct_gather_profile/influxdb _send_data: data write success</div>
<div> [22144064.batch] debug2: profile signaling type Task</div>
<div> [22144064.extern] debug2: profile signaling type Task</div>
<div> [22144064.extern] debug2: jobacct_gather/cgroup: jag_common_poll_data: energycounted = 0</div>
<div> [22144064.extern] debug2: jobacct_gather/cgroup: jag_common_poll_data: jag_common_poll_data: energy = 0 watts = 0 ave_watts = 0</div>
<div> [22144064.extern] debug: jobacct_gather/cgroup: jag_common_poll_data: jag_common_poll_data: Task 0 pid 61814 ave_freq = 1846 mem size/max 0/0 vmem size/max 1</div>
<div>14872320/114872320, disk read size/max (3676/3676), disk write size/max (0/0), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0</div>
<div>[22144064.batch] debug2: profile signaling type Task</div>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div>Thanks & Best</div>
<div>Eg</div>
</div>