-
Notifications
You must be signed in to change notification settings - Fork 269
waitpid is O(# children), resulting in O((# sim processes)^2) in thread_ptrace #1134
Description
Broken out from #1041
Using a syscall-heavy-microbenchmark, I did some perf-based profiling of ptrace mode.
Without cpu-pinning, we spend 20% of our time in wait_consider_task inside the kernel. It turns out that waitpid ends up using the wait4 syscall, which ends up in do_wait, which checks every child and every ptraced process; not just the one whose pid was supplied.
From perf report (there are 4 symmetric worker threads; the total counts are ~4x these):
- 7.21% 6.55% 24363 worker-3 [kernel.vmlinux] [k] wait_cons▒
- 6.55% __GI___clone (inlined) ▒
start_thread ▒
worker_run ▒
- event_execute ▒
- 6.54% process_continue ▒
thread_resume ▒
_thread_cleanupSysCallCondition (inlined) ▒
- threadptrace_resume ▒
- 6.41% _threadptrace_doAttach (inlined) ▒
_threadptrace_nextChildState ▒
__waitpid ▒
entry_SYSCALL_64 ▒
do_syscall_64 ▒
__x64_sys_wait4 ▒
__do_sys_wait4 ▒
- kernel_wait4 ▒
- 6.06% do_wait ▒
6.03% wait_consider_task
We spend less time here with cpu-pinning enabled. I suspect this is because do_wait actually does a semi-busy wait of checking all children and then calling schedule. i.e. it can get repeatedly woken up again, find there's still no children ready, and then go back to sleep. With cpu-pinning enabled, the waiting thread won't be able to be scheduled again (to do a futile linear scan) until the child has actually stopped and relinquished that cpu.
With cpu-pinning enabled:
- 0.75% 0.63% 1690 worker-1 [kernel.vmlinux] [k] wait_consider_task
0.63% __GI___clone (inlined) ▒
start_thread ▒
worker_run ▒
event_execute ▒
process_continue ▒
thread_resume ▒
_thread_cleanupSysCallCondition (inlined) ▒
- threadptrace_resume ▒
- 0.61% _threadptrace_doAttach (inlined) ▒
_threadptrace_nextChildState ▒
__waitpid ▒
entry_SYSCALL_64 ▒
do_syscall_64 ▒
__x64_sys_wait4 ▒
__do_sys_wait4 ▒
- kernel_wait4 ▒
- 0.58% do_wait ▒
0.57% wait_consider_task ▒
+ 0.72% 0.59% 1599 worker-3 [kernel.vmlinux] [k] wait_consider_task ▒
+ 0.71% 0.59% 1599 worker-2 [kernel.vmlinux] [k] wait_consider_task ▒
+ 0.71% 0.59% 1600 worker-0 [kernel.vmlinux] [k] wait_consider_task
While these numbers are less worrisome, the above is all with 256 hosts (each with 1 simulated process). If we double the number of hosts doing the same amount of work, we should expect that the number of calls to waitpid will be doubled, and its cost will be doubled. i.e. we're going to see quadratic growth here.
Experimentally, doubling the # of hosts to 512:
- 1.48% 1.31% 5469 worker-1 [kernel.vmlinux] [k] wait_consider_task ◆
1.31% __GI___clone (inlined) ▒
start_thread ▒
worker_run ▒
- event_execute ▒
- 1.31% process_continue ▒
thread_resume ▒
_thread_cleanupSysCallCondition (inlined) ▒
- threadptrace_resume ▒
- 1.27% _threadptrace_doAttach (inlined) ▒
_threadptrace_nextChildState ▒
__waitpid ▒
entry_SYSCALL_64 ▒
do_syscall_64 ▒
__x64_sys_wait4 ▒
__do_sys_wait4 ▒
- kernel_wait4 ▒
- 1.21% do_wait ▒
1.21% wait_consider_task ▒
+ 1.43% 1.25% 5173 worker-0 [kernel.vmlinux] [k] wait_consider_task ▒
+ 1.41% 1.23% 5146 worker-3 [kernel.vmlinux] [k] wait_consider_task ▒
+ 1.34% 1.17% 4885 worker-2 [kernel.vmlinux] [k] wait_consider_task
As expected, the percentages (cost relative to the rest of the simulation) doubles, and the raw number of samples increases super-linearly (About 3.2x vs the expected 4x).