waitpid is O(# children), resulting in O((# sim processes)^2) in thread_ptrace

Broken out from #1041 

Using a [syscall-heavy-microbenchmark](https://github.com/sporksmith/phantom-paper/blob/realtime-experiment/experiments/realtime/workload.c), I did some perf-based profiling of ptrace mode.

Without cpu-pinning, we spend 20% of our time in [wait_consider_task inside the kernel](https://github.com/torvalds/linux/blob/219d54332a09e8d8741c1e1982f5eae56099de85/kernel/exit.c#L1279). It turns out that `waitpid` ends up using the `wait4` syscall, which ends up in [do_wait](https://github.com/torvalds/linux/blob/219d54332a09e8d8741c1e1982f5eae56099de85/kernel/exit.c#L1441), which checks *every* child and *every* ptraced process; not just the one whose pid was supplied.

From `perf report` (there are 4 symmetric worker threads; the total counts are ~4x these):

```
-    7.21%     6.55%         24363  worker-3         [kernel.vmlinux]         [k] wait_cons▒
   - 6.55% __GI___clone (inlined)                                                          ▒
        start_thread                                                                       ▒
        worker_run                                                                         ▒
      - event_execute                                                                      ▒
         - 6.54% process_continue                                                          ▒
              thread_resume                                                                ▒
              _thread_cleanupSysCallCondition (inlined)                                    ▒
            - threadptrace_resume                                                          ▒
               - 6.41% _threadptrace_doAttach (inlined)                                    ▒
                    _threadptrace_nextChildState                                           ▒
                    __waitpid                                                              ▒
                    entry_SYSCALL_64                                                       ▒
                    do_syscall_64                                                          ▒
                    __x64_sys_wait4                                                        ▒
                    __do_sys_wait4                                                         ▒
                  - kernel_wait4                                                           ▒
                     - 6.06% do_wait                                                       ▒
                          6.03% wait_consider_task 
```

We spend less time here with cpu-pinning enabled. I suspect this is because `do_wait` actually does a semi-busy wait of checking all children and then [calling schedule](https://github.com/torvalds/linux/blob/219d54332a09e8d8741c1e1982f5eae56099de85/kernel/exit.c#L1485). i.e. it can get repeatedly woken up again, find there's still no children ready, and then go back to sleep. With cpu-pinning enabled, the waiting thread won't be able to be scheduled again (to do a futile linear scan) until the child has actually stopped and relinquished that cpu.

With cpu-pinning enabled:

```
-    0.75%     0.63%          1690  worker-1  [kernel.vmlinux]  [k] wait_consider_task 
     0.63% __GI___clone (inlined)                                                          ▒
        start_thread                                                                       ▒
        worker_run                                                                         ▒
        event_execute                                                                      ▒
        process_continue                                                                   ▒
        thread_resume                                                                      ▒
        _thread_cleanupSysCallCondition (inlined)                                          ▒
      - threadptrace_resume                                                                ▒
         - 0.61% _threadptrace_doAttach (inlined)                                          ▒
              _threadptrace_nextChildState                                                 ▒
              __waitpid                                                                    ▒
              entry_SYSCALL_64                                                             ▒
              do_syscall_64                                                                ▒
              __x64_sys_wait4                                                              ▒
              __do_sys_wait4                                                               ▒
            - kernel_wait4                                                                 ▒
               - 0.58% do_wait                                                             ▒
                    0.57% wait_consider_task                                               ▒
+    0.72%     0.59%          1599  worker-3  [kernel.vmlinux]  [k] wait_consider_task     ▒
+    0.71%     0.59%          1599  worker-2  [kernel.vmlinux]  [k] wait_consider_task     ▒
+    0.71%     0.59%          1600  worker-0  [kernel.vmlinux]  [k] wait_consider_task
```

While these numbers are less worrisome, the above is all with 256 hosts (each with 1 simulated process). If we double the number of hosts doing the same amount of work, we should expect that the number of calls to `waitpid` will be doubled, *and* its cost will be doubled. i.e. we're going to see quadratic growth here.

Experimentally, doubling the # of hosts to 512:

```
-    1.48%     1.31%          5469  worker-1  [kernel.vmlinux]  [k] wait_consider_task     ◆
     1.31% __GI___clone (inlined)                                                          ▒
        start_thread                                                                       ▒
        worker_run                                                                         ▒
      - event_execute                                                                      ▒
         - 1.31% process_continue                                                          ▒
              thread_resume                                                                ▒
              _thread_cleanupSysCallCondition (inlined)                                    ▒
            - threadptrace_resume                                                          ▒
               - 1.27% _threadptrace_doAttach (inlined)                                    ▒
                    _threadptrace_nextChildState                                           ▒
                    __waitpid                                                              ▒
                    entry_SYSCALL_64                                                       ▒
                    do_syscall_64                                                          ▒
                    __x64_sys_wait4                                                        ▒
                    __do_sys_wait4                                                         ▒
                  - kernel_wait4                                                           ▒
                     - 1.21% do_wait                                                       ▒
                          1.21% wait_consider_task                                         ▒
+    1.43%     1.25%          5173  worker-0  [kernel.vmlinux]  [k] wait_consider_task     ▒
+    1.41%     1.23%          5146  worker-3  [kernel.vmlinux]  [k] wait_consider_task     ▒
+    1.34%     1.17%          4885  worker-2  [kernel.vmlinux]  [k] wait_consider_task 
```

As expected, the percentages (cost relative to the rest of the simulation) doubles, and the raw number of samples increases super-linearly (About 3.2x vs the expected 4x).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

waitpid is O(# children), resulting in O((# sim processes)^2) in thread_ptrace #1134

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

waitpid is O(# children), resulting in O((# sim processes)^2) in thread_ptrace #1134

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions