Replace ConditionLock with wake-one signalling NIOThreadPoolWorkAvailable#3507
Merged
Lukasa merged 4 commits intoapple:mainfrom Feb 10, 2026
Merged
Replace ConditionLock with wake-one signalling NIOThreadPoolWorkAvailable#3507Lukasa merged 4 commits intoapple:mainfrom
ConditionLock with wake-one signalling NIOThreadPoolWorkAvailable#3507Lukasa merged 4 commits intoapple:mainfrom
Conversation
Motivation ---------- `NIOThreadPool` had no benchmarks measuring submit overhead. This makes it difficult to evaluate the cost of signalling changes or to catch latency regressions. Modifications ------------- Add thread pool submit benchmark, covering use cases with 4-thread and 16-thread pools. Result ------ `NIOThreadPool` submit throughput and context-switch overhead are now tracked by benchmarks. Benchmark Results ----------------- ``` NIOThreadPool.serial_wakeup(16 threads) ╒══════════════════════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╕ │ Metric │ p0 │ p25 │ p50 │ p75 │ p90 │ p99 │ p100 │ Samples │ ╞══════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡ │ Context switches (K) │ 77 │ 78 │ 79 │ 80 │ 80 │ 92 │ 92 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Syscalls (total) (K) * │ 106 │ 107 │ 108 │ 109 │ 110 │ 116 │ 116 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Time (system CPU) (ms) * │ 1649 │ 1752 │ 1768 │ 1795 │ 1826 │ 1929 │ 1929 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Time (total CPU) (ms) * │ 1701 │ 1805 │ 1821 │ 1849 │ 1879 │ 1987 │ 1987 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Time (user CPU) (ms) * │ 51 │ 52 │ 52 │ 53 │ 53 │ 57 │ 57 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Time (wall clock) (ms) * │ 167 │ 177 │ 178 │ 181 │ 183 │ 200 │ 200 │ 30 │ ╘══════════════════════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╛ NIOThreadPool.serial_wakeup(4 threads) ╒══════════════════════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╕ │ Metric │ p0 │ p25 │ p50 │ p75 │ p90 │ p99 │ p100 │ Samples │ ╞══════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡ │ Context switches (K) │ 44 │ 44 │ 44 │ 45 │ 45 │ 45 │ 45 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Syscalls (total) (K) * │ 65 │ 65 │ 65 │ 66 │ 66 │ 67 │ 67 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Time (system CPU) (ms) * │ 159 │ 162 │ 163 │ 165 │ 166 │ 169 │ 169 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Time (total CPU) (ms) * │ 178 │ 182 │ 183 │ 185 │ 186 │ 190 │ 190 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Time (user CPU) (ms) * │ 19 │ 19 │ 20 │ 20 │ 20 │ 21 │ 21 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Time (wall clock) (ms) * │ 76 │ 79 │ 79 │ 80 │ 80 │ 82 │ 82 │ 30 │ ╘══════════════════════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╛ ```
1e24ada to
24207d7
Compare
…ailable` Motivation ---------- `NIOThreadPool` used `ConditionLock` which calls `pthread_cond_broadcast` on every state change, waking all threads when only one work item is enqueued. This causes a thundering-herd problem. Modifications ------------- Add `NIOThreadPoolWorkAvailable` in NIOConcurrencyHelpers that uses `pthread_cond_signal` (wake-one) for work submission and `pthread_cond_broadcast` only for shutdown. Replace `ConditionLock<_WorkState>` and the `_WorkState` enum in `NIOThreadPool` with this new primitive. Result ------ Submitting a work item wakes exactly **one** thread instead of all threads. Benchmark Results ----------------- ``` NIOThreadPool.serial_wakeup(16 threads) ╒══════════════════════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╕ │ Metric │ p0 │ p25 │ p50 │ p75 │ p90 │ p99 │ p100 │ Samples │ ╞══════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡ │ Context switches (K) │ 20 │ 20 │ 20 │ 20 │ 20 │ 20 │ 20 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Syscalls (total) (K) * │ 40 │ 40 │ 40 │ 40 │ 40 │ 40 │ 40 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Time (system CPU) (ms) * │ 47 │ 49 │ 49 │ 50 │ 50 │ 55 │ 55 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Time (total CPU) (ms) * │ 57 │ 58 │ 59 │ 60 │ 61 │ 67 │ 67 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Time (user CPU) (ms) * │ 10 │ 10 │ 10 │ 10 │ 10 │ 12 │ 12 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Time (wall clock) (ms) * │ 54 │ 55 │ 56 │ 56 │ 57 │ 65 │ 65 │ 30 │ ╘══════════════════════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╛ NIOThreadPool.serial_wakeup(4 threads) ╒══════════════════════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╕ │ Metric │ p0 │ p25 │ p50 │ p75 │ p90 │ p99 │ p100 │ Samples │ ╞══════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡ │ Context switches (K) │ 20 │ 20 │ 20 │ 20 │ 20 │ 20 │ 20 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Syscalls (total) (K) * │ 40 │ 40 │ 40 │ 40 │ 40 │ 40 │ 40 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Time (system CPU) (ms) * │ 45 │ 46 │ 46 │ 47 │ 57 │ 75 │ 75 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Time (total CPU) (ms) * │ 54 │ 55 │ 56 │ 57 │ 68 │ 87 │ 87 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Time (user CPU) (μs) * │ 9055 │ 9372 │ 9478 │ 9765 │ 10887 │ 12585 │ 12585 │ 30 │ ├──────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┼───────────┤ │ Time (wall clock) (ms) * │ 52 │ 53 │ 53 │ 54 │ 64 │ 124 │ 124 │ 30 │ ╘══════════════════════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╛ ```
24207d7 to
3b2e685
Compare
Contributor
Author
Running
|
| Benchmark | Before (avg) | After (avg) | Improvement |
|---|---|---|---|
| 4 threads, 10k tasks | ~79ms | ~53ms | 1.5x faster (33% reduction) |
| 16 threads, 10k tasks | ~165ms | ~57ms | 2.9x faster (66% reduction) |
KushalP
commented
Feb 10, 2026
Replace `inout Int` closure parameters with returned deltas to eliminate heap allocation of mutable captured state. Each `inout` parameter allocates a box on the heap to allow the closure to mutate the referenced value, adding one allocation per thread pool operation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR This change leads to a ~90% reduction in observed system CPU time for some use cases by waking a single thread, instead of all idle threads.
Changes
Inlining the commit messages here.
Add NIOThreadPool submit throughput benchmarks
Motivation
NIOThreadPoolhad no benchmarks measuring submit overhead. This makes it difficult to evaluate the cost of signalling changes or to catch latency regressions.Modifications
Add thread pool submit benchmarks, covering use cases with 4-thread and 16-thread pools.
Result
NIOThreadPoolsubmit throughput and context-switch overhead are now tracked by benchmarks.Benchmark Results
Details
Replace
ConditionLockwith wake-one signallingNIOThreadPoolWorkAvailableMotivation
NIOThreadPoolusedConditionLockwhich callspthread_cond_broadcaston every state change, waking all threads when only one work item is enqueued. This causes a thundering-herd problem.Modifications
Add
NIOThreadPoolWorkAvailablein NIOConcurrencyHelpers that usespthread_cond_signal(wake-one) for work submission andpthread_cond_broadcastonly for shutdown. ReplaceConditionLock<_WorkState>and the_WorkStateenum inNIOThreadPoolwith this new primitive.Result
Submitting a work item wakes exactly one thread instead of all threads.
Benchmark Results
Details