Skip to content

Remove reference counting from size classed chunks#16306

Merged
chrisvest merged 9 commits into
netty:4.2from
franz1981:4.2_rm_rf_cnt_chunks
Feb 26, 2026
Merged

Remove reference counting from size classed chunks#16306
chrisvest merged 9 commits into
netty:4.2from
franz1981:4.2_rm_rf_cnt_chunks

Conversation

@franz1981

Copy link
Copy Markdown
Contributor

Motivation:

SizeClassedChunk performs 2 atomic ops (retain/release) per allocation cycle on the hot path.

Modification:

Replace ref counting with a segment-count state machine that only needs atomics on the cold deallocation path.

Result:

No more per-allocation atomic operations for SizeClassedChunk.

@franz1981

Copy link
Copy Markdown
Contributor Author

This is a change coming from #15741 @ 8953bbe

@franz1981

franz1981 commented Feb 19, 2026

Copy link
Copy Markdown
Contributor Author

FYI @laosijikaichele this should deliver quite a decent boost 🗡️
in pretty much ALL cases, non thread locals and thread locals in particular ^^

@chrisvest sadly I have left the ref cnt to still be allocated per chunk, due to other optimizations, but it shouldn't be used

@franz1981

Copy link
Copy Markdown
Contributor Author

This is ready to go @normanmaurer in term of perf numbers I have already performed tests a lot for other prs on this, and the numbers were showing 1.5X or 2.X better perf for the thread local allocations

@franz1981 franz1981 force-pushed the 4.2_rm_rf_cnt_chunks branch from 0eefff6 to 842abd9 Compare February 19, 2026 09:41
@normanmaurer normanmaurer added this to the 4.2.11.Final milestone Feb 19, 2026
@franz1981

Copy link
Copy Markdown
Contributor Author

@laosijikaichele I will check if using your perf test on mimalloc, but sadly we don't have the same HW so I will likely get some different numbers 😢
But last time I tried this, it was speeding things up a lot

@laosijikaichele

Copy link
Copy Markdown
Contributor

@franz1981 great work, no worries about the hardware, we can mainly compare the number change on adaptive, before and after this PR, to see the improvement.

@chrisvest chrisvest added needs-cherry-pick-4.1 This PR should be cherry-picked to 4.1 once merged. needs-cherry-pick-5.0 This PR should be cherry-picked to 5.0 once merged. labels Feb 21, 2026
@normanmaurer

Copy link
Copy Markdown
Member

@chrisvest @laosijikaichele @yawkat PTAL

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
@franz1981 franz1981 force-pushed the 4.2_rm_rf_cnt_chunks branch from e5751d6 to 21f7420 Compare February 23, 2026 15:06
@franz1981

Copy link
Copy Markdown
Contributor Author

PTAL @normanmaurer I have simplified the capacity method to make it inlineable, but TBH it's the logic there which concern me, since mpsc::size is not "trivial" (is a loop!)

@franz1981

Copy link
Copy Markdown
Contributor Author

@laosijikaichele I was wrong, it seems to improve, on my x86 by a mere ~7-8% atm, maybe you will have better luck with aarch64?
Now the most relevant cost is the single buffer release :"( (due to the new RefCnt indirection and VarHandle usage)

@franz1981

Copy link
Copy Markdown
Contributor Author

When i see things like these my heart is bleeding 😢

   0.10%    0x00007f1ddc2103cb:   jg     0x00007f1ddc21118c           ;*if_icmpgt {reexecute=0 rethrow=0 return_oop=0}
                                                                      ; - io.netty.buffer.AdaptivePoolingAllocator::allocate@7 (line 260)
            0x00007f1ddc2103d1:   lea    0x1f(%rdx),%r8d
   4.09%    0x00007f1ddc2103d5:   sar    $0x5,%r8d                    ;*ishr {reexecute=0 rethrow=0 return_oop=0}
                                                                      ; - io.netty.buffer.AdaptivePoolingAllocator::sizeIndexOf@5 (line 282)
                                                                      ; - io.netty.buffer.AdaptivePoolingAllocator::sizeClassIndexOf@1 (line 286)
                                                                      ; - io.netty.buffer.AdaptivePoolingAllocator::allocate@11 (line 261)

I know that is a data dep but, c'mon...on Intel this the lea is faster...

@franz1981 franz1981 force-pushed the 4.2_rm_rf_cnt_chunks branch 2 times, most recently from dfb6661 to 674ce6b Compare February 23, 2026 15:50
@franz1981

Copy link
Copy Markdown
Contributor Author

@laosijikaichele i would be curious how the numbers now look vs mimalloc to understand the gap left (assuming proper sized chunk q 😢)

chrisvest
chrisvest previously approved these changes Feb 23, 2026

@chrisvest chrisvest left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit but overall this looks good.

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java
@franz1981 franz1981 force-pushed the 4.2_rm_rf_cnt_chunks branch from 674ce6b to 81de40e Compare February 23, 2026 20:10
@franz1981

franz1981 commented Feb 23, 2026

Copy link
Copy Markdown
Contributor Author

Results on my machine
without thread local accesses

  ┌────────────────────────┬──────────────────────┬──────────────────────┬───────┐
  │        Scenario        │  Before (77e2a683)   │   After (6975d4e7)   │ Delta │
  ├────────────────────────┼──────────────────────┼──────────────────────┼───────┤
  │ Default                │ 63.368 ± 1.487 ns/op │ 60.635 ± 0.900 ns/op │ -4.3% │
  ├────────────────────────┼──────────────────────┼──────────────────────┼───────┤
  │ --enable-native-access │ 62.687 ± 0.828 ns/op │ 59.861 ± 0.240 ns/op │ -4.5% │
  └────────────────────────┴──────────────────────┴──────────────────────┴───────┘

and thread local one

  ┌───────────────────────────────────────────┬──────────────────────┬──────────────────────┬────────┐
  │                 Scenario                  │  Before (77e2a683)   │   After (6975d4e7)   │ Delta  │
  ├───────────────────────────────────────────┼──────────────────────┼──────────────────────┼────────┤
  │ Custom executor                           │ 37.993 ± 0.762 ns/op │ 35.543 ± 0.936 ns/op │ -6.4%  │
  ├───────────────────────────────────────────┼──────────────────────┼──────────────────────┼────────┤
  │ Custom executor + native-access           │ 39.102 ± 0.503 ns/op │ 34.715 ± 0.704 ns/op │ -11.2% │
  └───────────────────────────────────────────┴──────────────────────┴──────────────────────┴────────┘

Not wow, but not terrible.
Let's say the most of the improvements were already rolled out in the previous PR i made@ #15741
Would be nice to see how much aarch64 benefit from removing atomic ops in the hot path (twice atomic ops)

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
@franz1981 franz1981 force-pushed the 4.2_rm_rf_cnt_chunks branch from d0afc55 to d921c93 Compare February 23, 2026 20:12
@franz1981

Copy link
Copy Markdown
Contributor Author

And the other hidden performance benefit is that we now have scalable allocate/release from different threads - because it will scale as much as the underlying mpsc queue, while before it was "limited" by the shared "chunk" ref cnt. It's a pathological case, but still, worth mentioning.

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
@franz1981

Copy link
Copy Markdown
Contributor Author

PTAL @normanmaurer if you're happy with the first numbers at #16306 (comment)

@chrisvest chrisvest dismissed their stale review February 23, 2026 22:47

Code was updated

@chrisvest

Copy link
Copy Markdown
Member

@chrisvest in relation of your request to turn some release into mark to deallocate I'll gently push back: these three release() calls are all refcount-related operations (undoing a prior retain()), not "I'm done with this chunk" signals. Changing them to markToDeallocate() would be semantically misleading IMHO 🙏

Fair enough 👍

@chrisvest

Copy link
Copy Markdown
Member

Revert the io_uring changes. We can rebase on #16359 when it goes in.

The builds got cancelled after 6 hours and didn't produce any logs, which is unfortunate as we don't know if it's the allocator changes or the io_uring changes that made them get stuck.

Motivation:

SizeClassedChunk performs 2 atomic ops (retain/release) per allocation
cycle on the hot path.

Modification:

Replace ref counting with a segment-count state machine that only
needs atomics on the cold deallocation path.

Result:

No more per-allocation atomic operations for SizeClassedChunk.
@franz1981 franz1981 force-pushed the 4.2_rm_rf_cnt_chunks branch from 6c181fb to dec518f Compare February 26, 2026 08:58
@franz1981

Copy link
Copy Markdown
Contributor Author

PTAL @normanmaurer

@franz1981

Copy link
Copy Markdown
Contributor Author

@chrisvest done bud ! Let's see if the CI is happy ;)
Right after this @laosijikaichele if you are interested we can take a look to fix the chunk reuse q, which is not good ATM (the offer/poll storm is not good at all), maybe with @chrisvest ? wdyt?

@franz1981

Copy link
Copy Markdown
Contributor Author

CI is green @normanmaurer @chrisvest so, ready to go?

@chrisvest chrisvest merged commit de25e7a into netty:4.2 Feb 26, 2026
19 checks passed
@chrisvest

Copy link
Copy Markdown
Member

Very nice. Thanks!

@netty-project-bot

Copy link
Copy Markdown
Contributor

Could not create auto-port PR.
Got conflicts when cherry-picking onto 4.1.

netty-project-bot pushed a commit that referenced this pull request Feb 26, 2026
Motivation:

SizeClassedChunk performs 2 atomic ops (retain/release) per allocation
cycle on the hot path.

Modification:

Replace ref counting with a segment-count state machine that only needs
atomics on the cold deallocation path.

Result:

No more per-allocation atomic operations for SizeClassedChunk.

(cherry picked from commit de25e7a)
@netty-project-bot

Copy link
Copy Markdown
Contributor

Auto-port PR for 5.0: #16378

@github-actions github-actions Bot removed the needs-cherry-pick-5.0 This PR should be cherry-picked to 5.0 once merged. label Feb 26, 2026
chrisvest pushed a commit to chrisvest/netty that referenced this pull request Feb 26, 2026
Motivation:

SizeClassedChunk performs 2 atomic ops (retain/release) per allocation
cycle on the hot path.

Modification:

Replace ref counting with a segment-count state machine that only needs
atomics on the cold deallocation path.

Result:

No more per-allocation atomic operations for SizeClassedChunk.

(cherry picked from commit de25e7a)
@chrisvest

Copy link
Copy Markdown
Member

Port PR for 4.1: #16379

@chrisvest chrisvest removed the needs-cherry-pick-4.1 This PR should be cherry-picked to 4.1 once merged. label Feb 26, 2026
chrisvest added a commit that referenced this pull request Feb 26, 2026
Motivation:

SizeClassedChunk performs 2 atomic ops (retain/release) per allocation
cycle on the hot path.

Modification:

Replace ref counting with a segment-count state machine that only needs
atomics on the cold deallocation path.

Result:

No more per-allocation atomic operations for SizeClassedChunk.

(cherry picked from commit de25e7a)

Co-authored-by: Francesco Nigro <nigro.fra@gmail.com>
@franz1981 franz1981 deleted the 4.2_rm_rf_cnt_chunks branch February 27, 2026 12:55
chrisvest pushed a commit that referenced this pull request Feb 27, 2026
…6378)

Auto-port of #16306 to 5.0
Cherry-picked commit: de25e7a

---
Motivation:

SizeClassedChunk performs 2 atomic ops (retain/release) per allocation
cycle on the hot path.

Modification:

Replace ref counting with a segment-count state machine that only needs
atomics on the cold deallocation path.

Result:

No more per-allocation atomic operations for SizeClassedChunk.

Co-authored-by: Francesco Nigro <nigro.fra@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants