Remove reference counting from size classed chunks by franz1981 · Pull Request #16306 · netty/netty

franz1981 · 2026-02-19T08:33:03Z

Motivation:

SizeClassedChunk performs 2 atomic ops (retain/release) per allocation cycle on the hot path.

Modification:

Replace ref counting with a segment-count state machine that only needs atomics on the cold deallocation path.

Result:

No more per-allocation atomic operations for SizeClassedChunk.

franz1981 · 2026-02-19T08:33:27Z

This is a change coming from #15741 @ 8953bbe

franz1981 · 2026-02-19T08:47:13Z

FYI @laosijikaichele this should deliver quite a decent boost 🗡️
in pretty much ALL cases, non thread locals and thread locals in particular ^^

@chrisvest sadly I have left the ref cnt to still be allocated per chunk, due to other optimizations, but it shouldn't be used

franz1981 · 2026-02-19T09:32:09Z

This is ready to go @normanmaurer in term of perf numbers I have already performed tests a lot for other prs on this, and the numbers were showing 1.5X or 2.X better perf for the thread local allocations

franz1981 · 2026-02-21T10:27:30Z

@laosijikaichele I will check if using your perf test on mimalloc, but sadly we don't have the same HW so I will likely get some different numbers 😢
But last time I tried this, it was speeding things up a lot

laosijikaichele · 2026-02-21T10:49:09Z

@franz1981 great work, no worries about the hardware, we can mainly compare the number change on adaptive, before and after this PR, to see the improvement.

normanmaurer · 2026-02-23T14:15:10Z

@chrisvest @laosijikaichele @yawkat PTAL

franz1981 · 2026-02-23T15:07:05Z

PTAL @normanmaurer I have simplified the capacity method to make it inlineable, but TBH it's the logic there which concern me, since mpsc::size is not "trivial" (is a loop!)

franz1981 · 2026-02-23T15:08:06Z

@laosijikaichele I was wrong, it seems to improve, on my x86 by a mere ~7-8% atm, maybe you will have better luck with aarch64?
Now the most relevant cost is the single buffer release :"( (due to the new RefCnt indirection and VarHandle usage)

franz1981 · 2026-02-23T15:28:10Z

When i see things like these my heart is bleeding 😢

   0.10%    0x00007f1ddc2103cb:   jg     0x00007f1ddc21118c           ;*if_icmpgt {reexecute=0 rethrow=0 return_oop=0}
                                                                      ; - io.netty.buffer.AdaptivePoolingAllocator::allocate@7 (line 260)
            0x00007f1ddc2103d1:   lea    0x1f(%rdx),%r8d
   4.09%    0x00007f1ddc2103d5:   sar    $0x5,%r8d                    ;*ishr {reexecute=0 rethrow=0 return_oop=0}
                                                                      ; - io.netty.buffer.AdaptivePoolingAllocator::sizeIndexOf@5 (line 282)
                                                                      ; - io.netty.buffer.AdaptivePoolingAllocator::sizeClassIndexOf@1 (line 286)
                                                                      ; - io.netty.buffer.AdaptivePoolingAllocator::allocate@11 (line 261)

I know that is a data dep but, c'mon...on Intel this the lea is faster...

franz1981 · 2026-02-23T16:03:11Z

@laosijikaichele i would be curious how the numbers now look vs mimalloc to understand the gap left (assuming proper sized chunk q 😢)

chrisvest

One nit but overall this looks good.

franz1981 · 2026-02-23T20:10:23Z

Results on my machine
without thread local accesses

  ┌────────────────────────┬──────────────────────┬──────────────────────┬───────┐
  │        Scenario        │  Before (77e2a683)   │   After (6975d4e7)   │ Delta │
  ├────────────────────────┼──────────────────────┼──────────────────────┼───────┤
  │ Default                │ 63.368 ± 1.487 ns/op │ 60.635 ± 0.900 ns/op │ -4.3% │
  ├────────────────────────┼──────────────────────┼──────────────────────┼───────┤
  │ --enable-native-access │ 62.687 ± 0.828 ns/op │ 59.861 ± 0.240 ns/op │ -4.5% │
  └────────────────────────┴──────────────────────┴──────────────────────┴───────┘

and thread local one

  ┌───────────────────────────────────────────┬──────────────────────┬──────────────────────┬────────┐
  │                 Scenario                  │  Before (77e2a683)   │   After (6975d4e7)   │ Delta  │
  ├───────────────────────────────────────────┼──────────────────────┼──────────────────────┼────────┤
  │ Custom executor                           │ 37.993 ± 0.762 ns/op │ 35.543 ± 0.936 ns/op │ -6.4%  │
  ├───────────────────────────────────────────┼──────────────────────┼──────────────────────┼────────┤
  │ Custom executor + native-access           │ 39.102 ± 0.503 ns/op │ 34.715 ± 0.704 ns/op │ -11.2% │
  └───────────────────────────────────────────┴──────────────────────┴──────────────────────┴────────┘

Not wow, but not terrible.
Let's say the most of the improvements were already rolled out in the previous PR i made@ #15741
Would be nice to see how much aarch64 benefit from removing atomic ops in the hot path (twice atomic ops)

franz1981 · 2026-02-23T20:14:16Z

And the other hidden performance benefit is that we now have scalable allocate/release from different threads - because it will scale as much as the underlying mpsc queue, while before it was "limited" by the shared "chunk" ref cnt. It's a pathological case, but still, worth mentioning.

franz1981 · 2026-02-23T21:04:24Z

PTAL @normanmaurer if you're happy with the first numbers at #16306 (comment)

Code was updated

chrisvest · 2026-02-25T01:28:37Z

@chrisvest in relation of your request to turn some release into mark to deallocate I'll gently push back: these three release() calls are all refcount-related operations (undoing a prior retain()), not "I'm done with this chunk" signals. Changing them to markToDeallocate() would be semantically misleading IMHO 🙏

Fair enough 👍

chrisvest · 2026-02-25T18:56:57Z

Revert the io_uring changes. We can rebase on #16359 when it goes in.

The builds got cancelled after 6 hours and didn't produce any logs, which is unfortunate as we don't know if it's the allocator changes or the io_uring changes that made them get stuck.

Motivation: SizeClassedChunk performs 2 atomic ops (retain/release) per allocation cycle on the hot path. Modification: Replace ref counting with a segment-count state machine that only needs atomics on the cold deallocation path. Result: No more per-allocation atomic operations for SizeClassedChunk.

…tivePoolingAllocator

…ocation logic

… and capacity checks

franz1981 · 2026-02-26T08:58:35Z

PTAL @normanmaurer

franz1981 · 2026-02-26T10:11:44Z

@chrisvest done bud ! Let's see if the CI is happy ;)
Right after this @laosijikaichele if you are interested we can take a look to fix the chunk reuse q, which is not good ATM (the offer/poll storm is not good at all), maybe with @chrisvest ? wdyt?

franz1981 · 2026-02-26T14:02:32Z

CI is green @normanmaurer @chrisvest so, ready to go?

chrisvest · 2026-02-26T18:01:09Z

Very nice. Thanks!

netty-project-bot · 2026-02-26T18:01:15Z

Could not create auto-port PR.
Got conflicts when cherry-picking onto 4.1.

Motivation: SizeClassedChunk performs 2 atomic ops (retain/release) per allocation cycle on the hot path. Modification: Replace ref counting with a segment-count state machine that only needs atomics on the cold deallocation path. Result: No more per-allocation atomic operations for SizeClassedChunk. (cherry picked from commit de25e7a)

netty-project-bot · 2026-02-26T18:01:17Z

Auto-port PR for 5.0: #16378

Motivation: SizeClassedChunk performs 2 atomic ops (retain/release) per allocation cycle on the hot path. Modification: Replace ref counting with a segment-count state machine that only needs atomics on the cold deallocation path. Result: No more per-allocation atomic operations for SizeClassedChunk. (cherry picked from commit de25e7a)

chrisvest · 2026-02-26T18:08:45Z

Port PR for 4.1: #16379

Motivation: SizeClassedChunk performs 2 atomic ops (retain/release) per allocation cycle on the hot path. Modification: Replace ref counting with a segment-count state machine that only needs atomics on the cold deallocation path. Result: No more per-allocation atomic operations for SizeClassedChunk. (cherry picked from commit de25e7a) Co-authored-by: Francesco Nigro <nigro.fra@gmail.com>

…6378) Auto-port of #16306 to 5.0 Cherry-picked commit: de25e7a --- Motivation: SizeClassedChunk performs 2 atomic ops (retain/release) per allocation cycle on the hot path. Modification: Replace ref counting with a segment-count state machine that only needs atomics on the cold deallocation path. Result: No more per-allocation atomic operations for SizeClassedChunk. Co-authored-by: Francesco Nigro <nigro.fra@gmail.com>

franz1981 requested review from chrisvest and normanmaurer February 19, 2026 09:30

franz1981 force-pushed the 4.2_rm_rf_cnt_chunks branch from 0eefff6 to 842abd9 Compare February 19, 2026 09:41

normanmaurer added this to the 4.2.11.Final milestone Feb 19, 2026

chrisvest added needs-cherry-pick-4.1 This PR should be cherry-picked to 4.1 once merged. needs-cherry-pick-5.0 This PR should be cherry-picked to 5.0 once merged. labels Feb 21, 2026

normanmaurer requested changes Feb 23, 2026

View reviewed changes

franz1981 force-pushed the 4.2_rm_rf_cnt_chunks branch from e5751d6 to 21f7420 Compare February 23, 2026 15:06

franz1981 force-pushed the 4.2_rm_rf_cnt_chunks branch 2 times, most recently from dfb6661 to 674ce6b Compare February 23, 2026 15:50

chrisvest previously approved these changes Feb 23, 2026

View reviewed changes

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java

franz1981 force-pushed the 4.2_rm_rf_cnt_chunks branch from 674ce6b to 81de40e Compare February 23, 2026 20:10

normanmaurer requested changes Feb 23, 2026

View reviewed changes

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated

franz1981 force-pushed the 4.2_rm_rf_cnt_chunks branch from d0afc55 to d921c93 Compare February 23, 2026 20:12

yawkat reviewed Feb 23, 2026

View reviewed changes

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated

yawkat reviewed Feb 23, 2026

View reviewed changes

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated

franz1981 added 9 commits February 26, 2026 09:57

Refactor remainingCapacity calculation and deallocation logic in Adap…

cf9083d

…tivePoolingAllocator

Refactor state handling in AdaptivePoolingAllocator to simplify deall…

4f63acf

…ocation logic

Address Chris's comments

02559c3

Address Norman's comments

c040028

Address Jonas comments

6947b45

Address Jonas comments (2)

8a5cfd3

Refactor chunk queue to use SizeClassedChunk for improved type safety…

4c81dcf

… and capacity checks

Fix logic in AdaptivePoolingAllocator

dec518f

franz1981 force-pushed the 4.2_rm_rf_cnt_chunks branch from 6c181fb to dec518f Compare February 26, 2026 08:58

chrisvest approved these changes Feb 26, 2026

View reviewed changes

chrisvest merged commit de25e7a into netty:4.2 Feb 26, 2026
19 checks passed

github-actions Bot removed the needs-cherry-pick-5.0 This PR should be cherry-picked to 5.0 once merged. label Feb 26, 2026

netty-project-bot mentioned this pull request Feb 26, 2026

Auto-port 5.0: Remove reference counting from size classed chunks #16378

Merged

chrisvest removed the needs-cherry-pick-4.1 This PR should be cherry-picked to 4.1 once merged. label Feb 26, 2026

franz1981 deleted the 4.2_rm_rf_cnt_chunks branch February 27, 2026 12:55

j-be mentioned this pull request Apr 9, 2026

Potential memory leak after upgrading from 4.2.10 to 4.2.11 #16606

Open

Uh oh!

Conversation

franz1981 commented Feb 19, 2026

Uh oh!

franz1981 commented Feb 19, 2026

Uh oh!

franz1981 commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Feb 19, 2026

Uh oh!

franz1981 commented Feb 21, 2026

Uh oh!

laosijikaichele commented Feb 21, 2026

Uh oh!

normanmaurer commented Feb 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

franz1981 commented Feb 23, 2026

Uh oh!

franz1981 commented Feb 23, 2026

Uh oh!

franz1981 commented Feb 23, 2026

Uh oh!

franz1981 commented Feb 23, 2026

Uh oh!

chrisvest left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

franz1981 commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

franz1981 commented Feb 23, 2026

Uh oh!

Uh oh!

Uh oh!

franz1981 commented Feb 23, 2026

Uh oh!

chrisvest commented Feb 25, 2026

Uh oh!

chrisvest commented Feb 25, 2026

Uh oh!

franz1981 commented Feb 26, 2026

Uh oh!

franz1981 commented Feb 26, 2026

Uh oh!

franz1981 commented Feb 26, 2026

Uh oh!

Uh oh!

chrisvest commented Feb 26, 2026

Uh oh!

netty-project-bot commented Feb 26, 2026

Uh oh!

netty-project-bot commented Feb 26, 2026

Uh oh!

chrisvest commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

franz1981 commented Feb 19, 2026 •

edited

Loading

franz1981 commented Feb 23, 2026 •

edited

Loading