Skip to content

Conversation

@withinboredom
Copy link
Member

@withinboredom withinboredom commented Nov 23, 2025

This PR adds sync.Pool direct handoff instead to both regular and worker threads, building on the semaphore-based FIFO approach in fix/latency. The pool eliminates the remaining channel contention and provides significant tail latency improvements, particularly for workers.

By using sync.Pool, we achieve a lock-free per-P dispatch, reducing contention dramatically. The dispatcher tries the pool first and then falls back to the semaphore implementation.

Benchmark Results

Workers (8 threads, 500 connections, 15s)

Configuration Req/sec P50 P75 P90 P99 Threads Global RQ
latency branch 78,161 6.03ms 8.08ms 11.19ms 18.38ms 45 145
pool (this PR) 76,259 6.06ms 7.45ms 9.03ms 12.99ms 46 217
Improvement -2.4% +0.5% -7.8% -19.3% -29.3% +1 +49.7%

Regular Threads (8 threads, 500 connections, 15s)

Configuration Req/sec P50 P75 P90 P99 Threads Global RQ
latency branch 42,096 11.46ms 12.35ms 13.35ms 17.18ms 62 3
pool (this PR) 42,592 11.40ms 12.26ms 13.27ms 15.95ms 67 11
Improvement +1.2% -0.5% -0.7% -0.6% -7.2% +5 +267%

Low Load (10 connections, 1 thread, 15s) - Regular Threads Only to show no difference

Configuration Req/sec P50 P99
baseline (main) 38,354 222µs 650µs
latency branch 37,977 225µs 657µs
pool (this PR) 38,584 222µs 643µs
Improvement +1.6% -1.3% -2.1%

Thread Affinity Tradeoff

Workers previously had a low-index affinity and would target the same threads under a low load (t[0], t[1], etc.). This minimised resource initialisation when frameworks lazily created resources (e.g. database connections). The new behaviour in this PR uses sync.Pool which uses per-P (processor) locality, distributing requests across multiple threads even under low loads.

I think this is actually a good thing, as the previous behaviour is actively dangerous in production scenarios.

Consider a scenario with 120 worker threads for an i/o heavy workload (this is actually our advice in multiple issues). Under normal load, maybe only t[0-80] are usually active, and thus only 80 database connections are open. On a Black Friday, the load spikes, and we activate t[101] for the first time, but it exceeds a database connection limit of 100, causing cascading failures during peak loads.

This is the worst possible time to discover resource limits.

With sync.Pool, a normal load eventually cycles through all 120 workers, ensuring no surprises under load. It’s worth noting that with per-P locality there’s a high probability you’ll get the same worker on the same connection, keeping various caches (CPU L1/L2/L3, etc.). This is actually probably better than the low-index affinity for cache efficiency.

Further improvements

Further improvements will result in diminishing returns. Based on eBPF profiling of the pool implementation under load, the remaining overhead is well-distributed:

Syscall profile:

  • futex: 902K calls, 211s total: significantly reduced from baseline’s 1.1M+ calls
  • sched_yield: 58K calls, 5.3s: intentional scheduler balancing (kept from earlier work)
  • File I/O (openat/newfstatat/close): ~2.8M operations, 11s: PHP script execution
  • nanosleep: 177K calls, 47s: timing operations

Off-CPU profile shows time is now spent primarily on:

  • PHP memory allocation (_emalloc_192) and hash operations (zend_hash_index_find)
  • Go work-stealing and scheduling (normal runtime overhead)

The Go-side dispatch optimisation in this PR has eliminated the primary bottleneck (futex contention on channels). The remaining time is spent on productive work (PHP execution) and unavoidable overhead (file I/O, timing). Future optimisation would need to focus on PHP internals, which are outside the scope of FrankenPHP’s Go dispatcher.

  • All benchmarks are done with significant tracing enabled and thus may be exaggerated under real workloads.

Signed-off-by: Robert Landers <landers.robert@gmail.com>
Signed-off-by: Robert Landers <landers.robert@gmail.com>
@withinboredom
Copy link
Member Author

merging to get tests to run 😂

@withinboredom withinboredom merged commit 9822b38 into fix/latency Nov 23, 2025
@withinboredom withinboredom deleted the pools branch November 23, 2025 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants