[slurm-users] slurm job left-overs (jobid.extern, jobid.batch) - slots consumed without a running process

Eg. Bo. egonle at aol.com
Thu Jun 2 17:58:24 UTC 2022


Hello,
for quite some time we are trying to trace down an issue with Slurm slots being consumed also the "job process" already completed on the given node. The environment is using Slurm 20.11.8.Unfortunately I cannot relate the issue to a specific configuration change.
It looks like the OS process finishes successfully however "Slurm" cleanup of the process or cgroups and so on cannot be completed (guessing).I was wondering if it relates to the cgropu settings (although those are in place since initial setup):# scontrol show config|grep -i cgroupJobAcctGatherType       = jobacct_gather/cgroup   ############ might it conflict with ProctrackType?ProctrackType           = proctrack/cgroupTaskPlugin              = task/cgroup,task/affinityCgroup Support Configuration:AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.confCgroupAutomount         = yesCgroupMountpoint        = /sys/fs/cgroup

Details:# squeue |grep mynode          22144064 mynode ABC.job jobuser  R   23:50:05      1 mynode   # ps aux|grep 22144064root     61565  0.0  0.0 480064 11880 ?        Sl   Jun01   0:16 slurmstepd: [22144064.extern]root     61834  0.0  0.0 490808  9628 ?        Sl   Jun01   0:04 slurmstepd: [22144064.batch]
^^^^^^^^ there's no subprocess slurm_script or any process
# strace -ff -p 61834strace: Process 61834 attached with 5 threads[pid 61895] restart_syscall(<... resuming interrupted read ...> <unfinished ...>[pid 61838] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>[pid 61837] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>[pid 61836] futex(0x2b8cf914455c, FUTEX_WAIT_PRIVATE, 5767, NULL <unfinished ...>[pid 61834] futex(0x15f6f30, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>[pid 61837] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)[pid 61837] futex(0x2b8cf9142120, FUTEX_WAKE_PRIVATE, 1) = 0[pid 61837] futex(0x2b8cf91420e4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 172939, {tv_sec=1654171864, tv_nsec=391983000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)[pid 61837] futex(0x2b8cf9142120, FUTEX_WAKE_PRIVATE, 1) = 0[pid 61837] futex(0x2b8cf91420e4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 172941, {tv_sec=1654171865, tv_nsec=391983000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)[pid 61837] futex(0x2b8cf9142120, FUTEX_WAKE_PRIVATE, 1) = 0


slurmd.log [22144064.extern] debug2: profile signaling type Task [22144064.extern] debug2: jobacct_gather/cgroup: jag_common_poll_data: energycounted = 0 [22144064.batch] debug2: profile signaling type Task [22144064.extern] debug2: jobacct_gather/cgroup: jag_common_poll_data: jag_common_poll_data: energy = 0 watts = 0 ave_watts = 0 [22144064.extern] debug:  jobacct_gather/cgroup: jag_common_poll_data: jag_common_poll_data: Task 0 pid 61814 ave_freq = 1846 mem size/max 0/0 vmem size/max 114872320/114872320, disk read size/max (3676/3676), disk write size/max (0/0), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0 [22144064.batch] debug2: profile signaling type Task [22144064.extern] debug2: profile signaling type Task [22144064.extern] debug2: jobacct_gather/cgroup: jag_common_poll_data: energycounted = 0 [22144064.extern] debug2: jobacct_gather/cgroup: jag_common_poll_data: jag_common_poll_data: energy = 0 watts = 0 ave_watts = 0 [22144064.extern] debug:  jobacct_gather/cgroup: jag_common_poll_data: jag_common_poll_data: Task 0 pid 61814 ave_freq = 1846 mem size/max 0/0 vmem size/max 114872320/114872320, disk read size/max (3676/3676), disk write size/max (0/0), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0 [22144064.extern] debug2: acct_gather_profile/influxdb: _send_data: acct_gather_profile/influxdb _send_data: data write success [22144064.batch] debug2: profile signaling type Task [22144064.extern] debug2: profile signaling type Task [22144064.extern] debug2: jobacct_gather/cgroup: jag_common_poll_data: energycounted = 0 [22144064.extern] debug2: jobacct_gather/cgroup: jag_common_poll_data: jag_common_poll_data: energy = 0 watts = 0 ave_watts = 0 [22144064.extern] debug:  jobacct_gather/cgroup: jag_common_poll_data: jag_common_poll_data: Task 0 pid 61814 ave_freq = 1846 mem size/max 0/0 vmem size/max 114872320/114872320, disk read size/max (3676/3676), disk write size/max (0/0), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0[22144064.batch] debug2: profile signaling type Task



Thanks & BestEg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220602/b0f163df/attachment.htm>


More information about the slurm-users mailing list