[slurm-users] Hung tasks and high load when cancelling jobs
Brendan Moloney
moloney.brendan at gmail.com
Wed May 2 21:23:44 MDT 2018
Hi,
Sometimes when jobs are cancelled I see a spike in system load and hung
task errors. It appears to be related to NFS and cgroups.
The slurmstepd process gets hung cleaning up cgroups:
INFO: task slurmstepd:11222 blocked for more than 120 seconds.
Not tainted 4.4.0-119-generic #143-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
slurmstepd D ffff8817b1d47808 0 11222 1 0x00000004
ffff8817b1d47808 0000000000000246 ffff880c48dd1c00 ffff881842de3800
ffff8817b1d48000 ffff880c4f9972c0 7fffffffffffffff ffffffff8184bd60
ffff8817b1d47960 ffff8817b1d47820 ffffffff8184b565 0000000000000000
Call Trace:
[<ffffffff8184bd60>] ? bit_wait+0x60/0x60
[<ffffffff8184b565>] schedule+0x35/0x80
[<ffffffff8184e6f6>] schedule_timeout+0x1b6/0x270
[<ffffffffc05fd130>] ? hash_ipport6_add+0x6c0/0x6c0 [ip_set_hash_ipport]
[<ffffffff810f8e2e>] ? ktime_get+0x3e/0xb0
[<ffffffff8184bd60>] ? bit_wait+0x60/0x60
[<ffffffff8184acd4>] io_schedule_timeout+0xa4/0x110
[<ffffffff8184bd7b>] bit_wait_io+0x1b/0x70
[<ffffffff8184b90f>] __wait_on_bit+0x5f/0x90
[<ffffffff8119296b>] wait_on_page_bit+0xcb/0xf0
[<ffffffff810c6b40>] ? autoremove_wake_function+0x40/0x40
[<ffffffff811a9abd>] shrink_page_list+0x78d/0x7a0
[<ffffffff811aa179>] shrink_inactive_list+0x209/0x520
[<ffffffff811aae03>] shrink_lruvec+0x583/0x740
[<ffffffff8109ab29>] ? __queue_work+0x139/0x3c0
[<ffffffff811ab0af>] shrink_zone+0xef/0x2e0
[<ffffffff811ab3fb>] do_try_to_free_pages+0x15b/0x3b0
[<ffffffff811ab89a>] try_to_free_mem_cgroup_pages+0xba/0x1a0
[<ffffffff812014b0>] mem_cgroup_force_empty_write+0x70/0xd0
[<ffffffff81115ea2>] cgroup_file_write+0x42/0x110
[<ffffffff81294210>] kernfs_fop_write+0x120/0x170
[<ffffffff81213cbb>] __vfs_write+0x1b/0x40
[<ffffffff81214679>] vfs_write+0xa9/0x1a0
[<ffffffff812135bf>] ? do_sys_open+0x1bf/0x2a0
[<ffffffff81215335>] SyS_write+0x55/0xc0
[<ffffffff8184f708>] entry_SYSCALL_64_fastpath+0x1c/0xbb
The actual process being submitted seems to always be hung on NFS I/O:
INFO: task wb_command:11247 blocked for more than 120 seconds.
Not tainted 4.4.0-119-generic #143-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
wb_command D ffff880aa39fb9f8 0 11247 1 0x00000004
ffff880aa39fb9f8 ffff880c4fc90440 ffff880c48847000 ffff880c43f91c00
ffff880aa39fc000 ffff880c4fc972c0 7fffffffffffffff ffffffff8184bd60
ffff880aa39fbb58 ffff880aa39fba10 ffffffff8184b565 0000000000000000
Call Trace:
[<ffffffff8184bd60>] ? bit_wait+0x60/0x60
[<ffffffff8184b565>] schedule+0x35/0x80
[<ffffffff8184e6f6>] schedule_timeout+0x1b6/0x270
[<ffffffff810b9a8b>] ? dequeue_entity+0x41b/0xa80
[<ffffffff8184bd60>] ? bit_wait+0x60/0x60
[<ffffffff8184acd4>] io_schedule_timeout+0xa4/0x110
[<ffffffff8184bd7b>] bit_wait_io+0x1b/0x70
[<ffffffff8184b90f>] __wait_on_bit+0x5f/0x90
[<ffffffff8184bd60>] ? bit_wait+0x60/0x60
[<ffffffff8184b9c2>] out_of_line_wait_on_bit+0x82/0xb0
[<ffffffff810c6b40>] ? autoremove_wake_function+0x40/0x40
[<ffffffffc0649f57>] nfs_wait_on_request+0x37/0x40 [nfs]
[<ffffffffc064ee73>] nfs_writepage_setup+0x103/0x600 [nfs]
[<ffffffffc064f44a>] nfs_updatepage+0xda/0x380 [nfs]
[<ffffffffc063ef3d>] nfs_write_end+0x13d/0x4b0 [nfs]
[<ffffffff81413dbd>] ? iov_iter_copy_from_user_atomic+0x8d/0x220
[<ffffffff8119375b>] generic_perform_write+0x11b/0x1d0
[<ffffffff811954a2>] __generic_file_write_iter+0x1a2/0x1e0
[<ffffffff811955c5>] generic_file_write_iter+0xe5/0x1e0
[<ffffffffc063e5fa>] nfs_file_write+0x9a/0x170 [nfs]
[<ffffffff81213c55>] new_sync_write+0xa5/0xf0
[<ffffffff81213cc9>] __vfs_write+0x29/0x40
[<ffffffff81214679>] vfs_write+0xa9/0x1a0
[<ffffffff81215335>] SyS_write+0x55/0xc0
[<ffffffff8184f708>] entry_SYSCALL_64_fastpath+0x1c/0xbb
I upgraded somewhat recently from 17.02 to 17.11, but I am not positive if
this bug is new or just went unnoticed previously.
Thanks,
Brendan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180502/feb33b35/attachment.html>
More information about the slurm-users
mailing list