Skip to content

Eliminate redundant bounds checks in CompositeByteBuf accessors#16525

Merged
normanmaurer merged 2 commits intonetty:4.2from
franz1981:composite_opt
Mar 24, 2026
Merged

Eliminate redundant bounds checks in CompositeByteBuf accessors#16525
normanmaurer merged 2 commits intonetty:4.2from
franz1981:composite_opt

Conversation

@franz1981
Copy link
Copy Markdown
Contributor

Motivation:

Every _getXxx/_setXxx in CompositeByteBuf delegates to the underlying buffer's public API (e.g. c.buf.getByte()), which re-checks bounds that the composite level already validated.

Modifications:

  • Add an abuf field to Component that caches (AbstractByteBuf) buf at construction time, or null for non-AbstractByteBuf wrappers
  • Use abuf._getXxx() (no bounds check) when available, fall back to buf.getXxx() otherwise
  • Add sequentialReadBytes and sequentialGetBytes JMH benchmarks

Result:

Redundant bounds checks eliminated on the hot path.

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Mar 18, 2026

@normanmaurer

  Benchmark                              (bufferType)  (size)  Mode  Cnt      Baseline (±err)       Commit1 (±err)   Ratio

  === Sequential ===
  sequentialGetBytes                     SMALL_CHUNKS   10240  thrpt   5   23563.32 ±  861.85   24996.74 ± 2834.19   1.06x
  sequentialGetBytes                     LARGE_CHUNKS   10240  thrpt   5   48116.85 ± 2396.69   56202.34 ±  674.51   1.17x
  sequentialGetBytes              SMALL_CHUNKS_DIRECT   10240  thrpt   5   20882.80 ± 3327.20   21529.90 ± 3040.57   1.03x
  sequentialGetBytes              LARGE_CHUNKS_DIRECT   10240  thrpt   5   38989.21 ±  771.70   44600.55 ±  546.00   1.14x

  sequentialReadBytes                    SMALL_CHUNKS   10240  thrpt   5   23030.90 ± 1182.36   24422.75 ±  859.11   1.06x
  sequentialReadBytes                    LARGE_CHUNKS   10240  thrpt   5   42214.83 ±  190.90   47075.50 ±  642.57   1.12x
  sequentialReadBytes             SMALL_CHUNKS_DIRECT   10240  thrpt   5   20359.31 ±  981.67   21145.33 ± 1250.80   1.04x
  sequentialReadBytes             LARGE_CHUNKS_DIRECT   10240  thrpt   5   34738.97 ±  263.33   37685.10 ±  313.38   1.08x

  === Random Access ===
  setGetLong                             SMALL_CHUNKS   10240  thrpt   5  3318081.9 ± 360816.0  3072150.6 ± 360034.8   0.93x
  setGetLong                             LARGE_CHUNKS   10240  thrpt   5 43629810.9 ± 433517.2 43819842.9 ± 240070.8   1.00x
  setLong                                SMALL_CHUNKS   10240  thrpt   5  5604152.6 ±  24458.0  6636585.7 ±  24754.9   1.18x
  setLong                                LARGE_CHUNKS   10240  thrpt   5 45805877.1 ± 116882.6 47834277.2 ± 431502.9   1.04x

Here we go:

  • LARGE_CHUNKS benefit most (+12-17%) — fewer component crossings means more time spent in the accessor itself
  • SMALL_CHUNKS benefit less (+3-6%) — binary search dominates, so the bounds check savings are diluted
  • Direct is consistently slower than heap (~15-20%) — that's the JDK 25 MemorySegment overhead on direct buffer access

@franz1981
Copy link
Copy Markdown
Contributor Author

@yawkat It's a tiny improvement, but for free, basically ^^
I'm checking for other low hanging fruits in assembly

Motivation:

Every _getXxx/_setXxx in CompositeByteBuf delegates to the underlying
buffer's public API (e.g. c.buf.getByte()), which re-checks bounds
that the composite level already validated.

Additionally, sequential readByte() calls pay a binary search cost on
every component boundary crossing, since findComponent0 only caches
the last accessed Component reference without its array index.

Modifications:

- Add an `abuf` field to Component that caches `(AbstractByteBuf) buf`
  at construction time, or null for non-AbstractByteBuf wrappers
- Use abuf._getXxx() (no bounds check) when available, fall back to
  buf.getXxx() otherwise
- Add sequentialReadBytes and sequentialGetBytes JMH benchmarks

Result:

Redundant bounds checks eliminated on the hot path.
@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Mar 18, 2026

Here we go, another round of improvements - this time for readByte only TBH:

  Benchmark                              (bufferType)  (size)    Baseline    Commit1   Commit1+2   C1+2/BL

  sequentialReadBytes                    SMALL_CHUNKS   10240    23031       24423      37012      1.61x  !!!
  sequentialReadBytes                    LARGE_CHUNKS   10240    42215       47076      48356      1.15x
  sequentialReadBytes             SMALL_CHUNKS_DIRECT   10240    20359       21145      31411      1.54x  !!!
  sequentialReadBytes             LARGE_CHUNKS_DIRECT   10240    20359       37685      38768      1.12x

  sequentialGetBytes                     SMALL_CHUNKS   10240    23563       24997      24215      1.03x  (unchanged, expected)
  sequentialGetBytes                     LARGE_CHUNKS   10240    48117       56202      56520      1.17x  (unchanged, expected)

multi-byte requires a bit more work so probably not worthy.


// weak cache - check it first when looking for component
private Component lastAccessed;
private int lastAccessedIndex;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JOL says this will fix into existing wasted space, so basically free ^^

this.srcBuf = srcBuf;
this.srcAdjustment = srcOffset - offset;
this.buf = buf;
this.abuf = buf instanceof AbstractByteBuf ? (AbstractByteBuf) buf : null;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to use the most optimized bound-check free version of each getXYZ access

@franz1981
Copy link
Copy Markdown
Contributor Author

@normanmaurer ready to go for me: I don't plan to put much more efforts here yet, as with minimal changes I see some measurable improvement, so, good ROI

Motivation:

Sequential readByte() calls inherited from AbstractByteBuf pay a
binary search cost on every component boundary crossing. For composites
with many small components (e.g. decompressor output), this is the
dominant cost — async-profiler shows 43% of CPU in findIt().

Modifications:

- Add lastAccessedIndex field to track the array position of the
  cached component
- Override readByte() to use lastAccessed cache directly, skipping
  the findComponent0 indirection
- Add findComponentForRead() that advances to the next non-empty
  component in O(1) instead of binary search, falling back to
  findIt() only on random jumps
- Add direct buffer variants to the sequential benchmark

Result:

Sequential readByte() throughput improved by up to 61% for small-chunk
composites (SMALL_CHUNKS 10240, JDK 25).
@franz1981
Copy link
Copy Markdown
Contributor Author

PTAL @normanmaurer this should be enough for now. Mostly low hanging fruits ;)

@chrisvest chrisvest added this to the 4.2.11.Final milestone Mar 24, 2026
@chrisvest chrisvest added the needs-cherry-pick-5.0 This PR should be cherry-picked to 5.0 once merged. label Mar 24, 2026
@normanmaurer normanmaurer merged commit 6001499 into netty:4.2 Mar 24, 2026
33 of 37 checks passed
netty-project-bot pushed a commit that referenced this pull request Mar 24, 2026
Motivation:

Every _getXxx/_setXxx in CompositeByteBuf delegates to the underlying
buffer's public API (e.g. c.buf.getByte()), which re-checks bounds that
the composite level already validated.

Modifications:

- Add an `abuf` field to Component that caches `(AbstractByteBuf) buf`
at construction time, or null for non-AbstractByteBuf wrappers
- Use abuf._getXxx() (no bounds check) when available, fall back to
buf.getXxx() otherwise
- Add sequentialReadBytes and sequentialGetBytes JMH benchmarks

Result:

Redundant bounds checks eliminated on the hot path.

(cherry picked from commit 6001499)
@netty-project-bot
Copy link
Copy Markdown
Contributor

Auto-port PR for 5.0: #16534

@github-actions github-actions bot removed the needs-cherry-pick-5.0 This PR should be cherry-picked to 5.0 once merged. label Mar 24, 2026
normanmaurer pushed a commit that referenced this pull request Mar 24, 2026
…accessors (#16534)

Auto-port of #16525 to 5.0
Cherry-picked commit: 6001499

---
Motivation:

Every _getXxx/_setXxx in CompositeByteBuf delegates to the underlying
buffer's public API (e.g. c.buf.getByte()), which re-checks bounds that
the composite level already validated.

Modifications:

- Add an `abuf` field to Component that caches `(AbstractByteBuf) buf`
at construction time, or null for non-AbstractByteBuf wrappers
- Use abuf._getXxx() (no bounds check) when available, fall back to
buf.getXxx() otherwise
- Add sequentialReadBytes and sequentialGetBytes JMH benchmarks

Result:

Redundant bounds checks eliminated on the hot path.

Co-authored-by: Francesco Nigro <nigro.fra@gmail.com>
chrisvest added a commit that referenced this pull request Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants