Skip to content

Auto-port 5.0: IoUring: extend user data from short to long#16806

Merged
chrisvest merged 2 commits into
5.0from
auto-port-pr-16682-to-5.0
May 13, 2026
Merged

Auto-port 5.0: IoUring: extend user data from short to long#16806
chrisvest merged 2 commits into
5.0from
auto-port-pr-16682-to-5.0

Conversation

@netty-project-bot

Copy link
Copy Markdown
Contributor

Auto-port of #16682 to 5.0
Cherry-picked commit: 7f00b24


Motivation:

This PR extends io_uring userData handling from short to long without changing the existing fast path for short values.

We reuse Netty's IoUringIoHandler to drive some one-shot io_uring operations through a shared DefaultIoUringIoRegistration per EventLoop. In this model, short user data is too limited for real usage: it is not enough for some tracking payloads and cannot reliably carry values such as an fd or other larger identifiers.

Modification:

  • Keep the existing packed fast path when userData still fits in short.
  • Add a slow path for larger long userData values.
  • Track slow-path SQEs with a lightweight per-SQE table (PendingOpSlots) and resolve completions through the live registration table.
  • Keep the io_uring channel code and IoUringIoOps path compatible with long userData.

Result:
Keep long user data support for custom IoHandle, preserve near-baseline performance for the short user data path, and confine the remaining extra bookkeeping cost to the long user data slow path.

Design:
I also evaluated other tracking strategies, including open addressing and HashMap / LongObjectMap-style mappings.

In practice, they were not a better fit for this workload:

  • Open addressing with tombstones still introduced extra probe / insert / remove bookkeeping, and its CPU cost became more visible once removals were frequent or the live set grew larger.
  • HashMap / LongObjectMap-style solutions added extra lookup / indirection overhead on the slow path and were not competitive enough for this use case.
  • HashMap / LongObjectMap-style solutions add extra gc overhead on the slow path and were not competitive enough for this use case.
  • Some alternatives improved one side of the workload, but paid for it either with higher steady-state CPU cost or with a more expensive remove path.

The current approach is a better overall tradeoff for the target scenario:

  • custom IoHandle usage is relatively uncommon
  • collisions are expected to be rare
  • resizes should therefore also be uncommon
  • for non-network io_uring operations, SQEs usually have a short pending lifetime

That makes a simple array-backed per-SQE tracking scheme a good fit here: it keeps the common case straightforward and avoids introducing extra hot-path cost for more general but heavier data structures.

CustomIoHandleBenchmark on the current branch and 4.2 base

Fast path vs baseline

pendingOpsDepth baseline fast current fast delta
4096 1,023,617 ops/s 1,024,338 ops/s +0.07%
65536 974,757 ops/s 970,427 ops/s -0.44%

Slow path vs current fast path

pendingOpsDepth fast path slow path delta
4096 1,024,338 ops/s 944,886 ops/s -7.76%
65536 970,427 ops/s 886,176 ops/s -8.68%

These numbers are in the expected range for the added slow-path bookkeeping, while keeping the existing short-value fast path intact.

https://gist.github.com/dreamlike-ocean/05e7e272e0e6a9f45f40192229c938dc

fix #16634

Motivation:

This PR extends io_uring `userData` handling from `short` to `long`
without changing the existing fast path for short values.

We reuse Netty's `IoUringIoHandler` to drive some one-shot io_uring
operations through a shared `DefaultIoUringIoRegistration` per
`EventLoop`. In this model, `short` user data is too limited for real
usage: it is not enough for some tracking payloads and cannot reliably
carry values such as an `fd` or other larger identifiers.

Modification:

- Keep the existing packed fast path when `userData` still fits in
`short`.
- Add a slow path for larger `long userData` values.
- Track slow-path SQEs with a lightweight per-SQE table
(`PendingOpSlots`) and resolve completions through the live registration
table.
- Keep the io_uring channel code and `IoUringIoOps` path compatible with
long `userData`.

Result:
Keep `long` user data support for custom `IoHandle`, preserve
near-baseline performance for the `short` user data path, and confine
the remaining extra bookkeeping cost to the `long` user data slow path.

Design:
I also evaluated other tracking strategies, including open addressing
and `HashMap` / `LongObjectMap`-style mappings.

In practice, they were not a better fit for this workload:

- Open addressing with tombstones still introduced extra probe / insert
/ remove bookkeeping, and its CPU cost became more visible once removals
were frequent or the live set grew larger.
- `HashMap` / `LongObjectMap`-style solutions added extra lookup /
indirection overhead on the slow path and were not competitive enough
for this use case.
- `HashMap` / `LongObjectMap`-style solutions add extra gc overhead on
the slow path and were not competitive enough for this use case.
- Some alternatives improved one side of the workload, but paid for it
either with higher steady-state CPU cost or with a more expensive remove
path.

The current approach is a better overall tradeoff for the target
scenario:

- custom `IoHandle` usage is relatively uncommon
- collisions are expected to be rare
- resizes should therefore also be uncommon
- for non-network io_uring operations, SQEs usually have a short pending
lifetime

That makes a simple array-backed per-SQE tracking scheme a good fit
here: it keeps the common case straightforward and avoids introducing
extra hot-path cost for more general but heavier data structures.

`CustomIoHandleBenchmark` on the current branch  and 4.2 base

 Fast path vs baseline
  | pendingOpsDepth | baseline fast | current fast | delta |
  | --- | ---: | ---: | ---: |
  | 4096 | 1,023,617 ops/s | 1,024,338 ops/s | +0.07% |
  | 65536 | 974,757 ops/s | 970,427 ops/s | -0.44%  |

Slow path vs current fast path
| pendingOpsDepth | fast path | slow path | delta |
| --- | ---: | ---: | ---: |
| 4096 | 1,024,338 ops/s | 944,886 ops/s | -7.76% |
| 65536 | 970,427 ops/s | 886,176 ops/s | -8.68% |

These numbers are in the expected range for the added slow-path
bookkeeping, while keeping the existing short-value fast path intact.

https://gist.github.com/dreamlike-ocean/05e7e272e0e6a9f45f40192229c938dc

fix #16634

---------

Co-authored-by: Chris Vest <christianvest_hansen@apple.com>
(cherry picked from commit 7f00b24)
@chrisvest chrisvest enabled auto-merge (squash) May 13, 2026 00:18
@chrisvest chrisvest merged commit 2c08424 into 5.0 May 13, 2026
31 of 33 checks passed
@chrisvest chrisvest deleted the auto-port-pr-16682-to-5.0 branch May 13, 2026 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants