Skip to content

waitpid is O(# children), resulting in O((# sim processes)^2) in thread_ptrace #1134

@sporksmith

Description

@sporksmith

Broken out from #1041

Using a syscall-heavy-microbenchmark, I did some perf-based profiling of ptrace mode.

Without cpu-pinning, we spend 20% of our time in wait_consider_task inside the kernel. It turns out that waitpid ends up using the wait4 syscall, which ends up in do_wait, which checks every child and every ptraced process; not just the one whose pid was supplied.

From perf report (there are 4 symmetric worker threads; the total counts are ~4x these):

-    7.21%     6.55%         24363  worker-3         [kernel.vmlinux]         [k] wait_cons▒
   - 6.55% __GI___clone (inlined)                                                          ▒
        start_thread                                                                       ▒
        worker_run                                                                         ▒
      - event_execute                                                                      ▒
         - 6.54% process_continue                                                          ▒
              thread_resume                                                                ▒
              _thread_cleanupSysCallCondition (inlined)                                    ▒
            - threadptrace_resume                                                          ▒
               - 6.41% _threadptrace_doAttach (inlined)                                    ▒
                    _threadptrace_nextChildState                                           ▒
                    __waitpid                                                              ▒
                    entry_SYSCALL_64                                                       ▒
                    do_syscall_64                                                          ▒
                    __x64_sys_wait4                                                        ▒
                    __do_sys_wait4                                                         ▒
                  - kernel_wait4                                                           ▒
                     - 6.06% do_wait                                                       ▒
                          6.03% wait_consider_task 

We spend less time here with cpu-pinning enabled. I suspect this is because do_wait actually does a semi-busy wait of checking all children and then calling schedule. i.e. it can get repeatedly woken up again, find there's still no children ready, and then go back to sleep. With cpu-pinning enabled, the waiting thread won't be able to be scheduled again (to do a futile linear scan) until the child has actually stopped and relinquished that cpu.

With cpu-pinning enabled:

-    0.75%     0.63%          1690  worker-1  [kernel.vmlinux]  [k] wait_consider_task 
     0.63% __GI___clone (inlined)                                                          ▒
        start_thread                                                                       ▒
        worker_run                                                                         ▒
        event_execute                                                                      ▒
        process_continue                                                                   ▒
        thread_resume                                                                      ▒
        _thread_cleanupSysCallCondition (inlined)                                          ▒
      - threadptrace_resume                                                                ▒
         - 0.61% _threadptrace_doAttach (inlined)                                          ▒
              _threadptrace_nextChildState                                                 ▒
              __waitpid                                                                    ▒
              entry_SYSCALL_64                                                             ▒
              do_syscall_64                                                                ▒
              __x64_sys_wait4                                                              ▒
              __do_sys_wait4                                                               ▒
            - kernel_wait4                                                                 ▒
               - 0.58% do_wait                                                             ▒
                    0.57% wait_consider_task                                               ▒
+    0.72%     0.59%          1599  worker-3  [kernel.vmlinux]  [k] wait_consider_task     ▒
+    0.71%     0.59%          1599  worker-2  [kernel.vmlinux]  [k] wait_consider_task     ▒
+    0.71%     0.59%          1600  worker-0  [kernel.vmlinux]  [k] wait_consider_task

While these numbers are less worrisome, the above is all with 256 hosts (each with 1 simulated process). If we double the number of hosts doing the same amount of work, we should expect that the number of calls to waitpid will be doubled, and its cost will be doubled. i.e. we're going to see quadratic growth here.

Experimentally, doubling the # of hosts to 512:

-    1.48%     1.31%          5469  worker-1  [kernel.vmlinux]  [k] wait_consider_task     ◆
     1.31% __GI___clone (inlined)                                                          ▒
        start_thread                                                                       ▒
        worker_run                                                                         ▒
      - event_execute                                                                      ▒
         - 1.31% process_continue                                                          ▒
              thread_resume                                                                ▒
              _thread_cleanupSysCallCondition (inlined)                                    ▒
            - threadptrace_resume                                                          ▒
               - 1.27% _threadptrace_doAttach (inlined)                                    ▒
                    _threadptrace_nextChildState                                           ▒
                    __waitpid                                                              ▒
                    entry_SYSCALL_64                                                       ▒
                    do_syscall_64                                                          ▒
                    __x64_sys_wait4                                                        ▒
                    __do_sys_wait4                                                         ▒
                  - kernel_wait4                                                           ▒
                     - 1.21% do_wait                                                       ▒
                          1.21% wait_consider_task                                         ▒
+    1.43%     1.25%          5173  worker-0  [kernel.vmlinux]  [k] wait_consider_task     ▒
+    1.41%     1.23%          5146  worker-3  [kernel.vmlinux]  [k] wait_consider_task     ▒
+    1.34%     1.17%          4885  worker-2  [kernel.vmlinux]  [k] wait_consider_task 

As expected, the percentages (cost relative to the rest of the simulation) doubles, and the raw number of samples increases super-linearly (About 3.2x vs the expected 4x).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Component: MainComposing the core Shadow executablePriority: HighPrioritized ahead of most other issuesTag: PerformanceRelated to improving shadow's run-timeType: BugError or flaw producing unexpected resultsType: EnhancementNew functionality or improved design

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions