<div dir="ltr"><div><div><div><div><div><div>Hi,<br><br></div>Sometimes when jobs are cancelled I see a spike in system load and hung task errors. It appears to be related to NFS and cgroups.<br><br></div>The slurmstepd process gets hung cleaning up cgroups:<br><br>INFO: task slurmstepd:11222 blocked for more than 120 seconds.<br>      Not tainted 4.4.0-119-generic #143-Ubuntu<br>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.<br>slurmstepd      D ffff8817b1d47808     0 11222      1 0x00000004<br> ffff8817b1d47808 0000000000000246 ffff880c48dd1c00 ffff881842de3800<br> ffff8817b1d48000 ffff880c4f9972c0 7fffffffffffffff ffffffff8184bd60<br> ffff8817b1d47960 ffff8817b1d47820 ffffffff8184b565 0000000000000000<br>Call Trace:<br> [<ffffffff8184bd60>] ? bit_wait+0x60/0x60<br> [<ffffffff8184b565>] schedule+0x35/0x80<br> [<ffffffff8184e6f6>] schedule_timeout+0x1b6/0x270<br> [<ffffffffc05fd130>] ? hash_ipport6_add+0x6c0/0x6c0 [ip_set_hash_ipport]<br> [<ffffffff810f8e2e>] ? ktime_get+0x3e/0xb0<br> [<ffffffff8184bd60>] ? bit_wait+0x60/0x60<br> [<ffffffff8184acd4>] io_schedule_timeout+0xa4/0x110<br> [<ffffffff8184bd7b>] bit_wait_io+0x1b/0x70<br> [<ffffffff8184b90f>] __wait_on_bit+0x5f/0x90<br> [<ffffffff8119296b>] wait_on_page_bit+0xcb/0xf0<br> [<ffffffff810c6b40>] ? autoremove_wake_function+0x40/0x40<br> [<ffffffff811a9abd>] shrink_page_list+0x78d/0x7a0<br> [<ffffffff811aa179>] shrink_inactive_list+0x209/0x520<br> [<ffffffff811aae03>] shrink_lruvec+0x583/0x740<br> [<ffffffff8109ab29>] ? __queue_work+0x139/0x3c0<br> [<ffffffff811ab0af>] shrink_zone+0xef/0x2e0<br> [<ffffffff811ab3fb>] do_try_to_free_pages+0x15b/0x3b0<br> [<ffffffff811ab89a>] try_to_free_mem_cgroup_pages+0xba/0x1a0<br> [<ffffffff812014b0>] mem_cgroup_force_empty_write+0x70/0xd0<br> [<ffffffff81115ea2>] cgroup_file_write+0x42/0x110<br> [<ffffffff81294210>] kernfs_fop_write+0x120/0x170<br> [<ffffffff81213cbb>] __vfs_write+0x1b/0x40<br> [<ffffffff81214679>] vfs_write+0xa9/0x1a0<br> [<ffffffff812135bf>] ? do_sys_open+0x1bf/0x2a0<br> [<ffffffff81215335>] SyS_write+0x55/0xc0<br> [<ffffffff8184f708>] entry_SYSCALL_64_fastpath+0x1c/0xbb<br><br></div>The actual process being submitted seems to always be hung on NFS I/O:<br><br>INFO: task wb_command:11247 blocked for more than 120 seconds.<br>      Not tainted 4.4.0-119-generic #143-Ubuntu<br>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.<br>wb_command      D ffff880aa39fb9f8     0 11247      1 0x00000004<br> ffff880aa39fb9f8 ffff880c4fc90440 ffff880c48847000 ffff880c43f91c00<br> ffff880aa39fc000 ffff880c4fc972c0 7fffffffffffffff ffffffff8184bd60<br> ffff880aa39fbb58 ffff880aa39fba10 ffffffff8184b565 0000000000000000<br>Call Trace:<br> [<ffffffff8184bd60>] ? bit_wait+0x60/0x60<br> [<ffffffff8184b565>] schedule+0x35/0x80<br> [<ffffffff8184e6f6>] schedule_timeout+0x1b6/0x270<br> [<ffffffff810b9a8b>] ? dequeue_entity+0x41b/0xa80<br> [<ffffffff8184bd60>] ? bit_wait+0x60/0x60<br> [<ffffffff8184acd4>] io_schedule_timeout+0xa4/0x110<br> [<ffffffff8184bd7b>] bit_wait_io+0x1b/0x70<br> [<ffffffff8184b90f>] __wait_on_bit+0x5f/0x90<br> [<ffffffff8184bd60>] ? bit_wait+0x60/0x60<br> [<ffffffff8184b9c2>] out_of_line_wait_on_bit+0x82/0xb0<br> [<ffffffff810c6b40>] ? autoremove_wake_function+0x40/0x40<br> [<ffffffffc0649f57>] nfs_wait_on_request+0x37/0x40 [nfs]<br> [<ffffffffc064ee73>] nfs_writepage_setup+0x103/0x600 [nfs]<br> [<ffffffffc064f44a>] nfs_updatepage+0xda/0x380 [nfs]<br> [<ffffffffc063ef3d>] nfs_write_end+0x13d/0x4b0 [nfs]<br> [<ffffffff81413dbd>] ? iov_iter_copy_from_user_atomic+0x8d/0x220<br> [<ffffffff8119375b>] generic_perform_write+0x11b/0x1d0<br> [<ffffffff811954a2>] __generic_file_write_iter+0x1a2/0x1e0<br> [<ffffffff811955c5>] generic_file_write_iter+0xe5/0x1e0<br> [<ffffffffc063e5fa>] nfs_file_write+0x9a/0x170 [nfs]<br> [<ffffffff81213c55>] new_sync_write+0xa5/0xf0<br> [<ffffffff81213cc9>] __vfs_write+0x29/0x40<br> [<ffffffff81214679>] vfs_write+0xa9/0x1a0<br> [<ffffffff81215335>] SyS_write+0x55/0xc0<br> [<ffffffff8184f708>] entry_SYSCALL_64_fastpath+0x1c/0xbb<br><br></div>I upgraded somewhat recently from 17.02 to 17.11, but I am not positive if this bug is new or just went unnoticed previously.<br><br></div>Thanks,<br></div>Brendan<br></div>