Simplify reference counting by chrisvest · Pull Request #15764 · netty/netty

chrisvest · 2025-10-16T18:16:09Z

Motivation:
Our current reference counting algorithm is quite complicated, and the compiled native code size puts pressure on the JIT inliner, which in turn hurts performance.

Modification:
Simplify the code by removing special case checks for certain constants.
Reorganize the code to use fewer branches.
Remove the different uses of ReferenceCountUpdater and introduce a RefCnt.
Avoid inheritance and use composition instead, so that we have a single implementation at runtime which helps both the C2 JIT compiler and Graal native-image.

Result:
Simpler code that's more JIT friendly, without loss of performance.

Motivation: Our current reference counting algorithm is quite complicated, trying to guard for word-tearing of 32-bit integers, which isn't possible in Java. The complicated code compiles to larger amounts of native code, which in turn can prevent inlining as it comes up against code size heuristics in the JIT. Modification: Simplify the code by removing special case checks for certain constants. Reorganize the code to use fewer branches Result: Simpler code that's more JIT friendly, without loss of performance.

chrisvest · 2025-10-16T18:19:03Z

@franz1981 Please try benchmarking these changes. On my M1, I get performance in AbstractReferenceCountedByteBufBenchmark that is on par with the current code.

If I change retain to use a CAS loop, its performance drops dramatically as soon as there's contention. I kept the even/odd reference counting scheme because of this, because it allows us to use the getAndAdd intrinsics, which scale much better.

franz1981 · 2025-10-16T18:21:54Z

But to be fair, it matters?
I mean, I would optimize for normal uncontended use cases if I have to pick a poison. This will improve the unconteded cases because the comparisons doesn't need to strip the parity bit and become (if predictable) super cheap.
If we want to go really fast (under contention) a double sequence scheme is the way because the two sequences keep on increasing and can use increment and get separately

franz1981 · 2025-10-16T18:25:10Z

You can try applying this on top of my adaptive pull and see how to perform since I didn't get rid of the chunk ref count for size classed because I was waiting this ❤️

chrisvest · 2025-10-16T18:25:18Z

It wasn't faster without the parity bit on my machine.

franz1981 · 2025-10-16T18:27:13Z

I believe it - I will try on my x86 as well to check how it behaves. I have to check to if it solves the problem in the issue related set bytes too

chrisvest · 2025-10-16T18:30:30Z

Also, I suspect the contended case is relevant for the chunks.

franz1981 · 2025-10-16T18:49:31Z

Uh, you are right. It is indeed (for shared magazines).
Although I have another solution there to get rid of it

chrisvest · 2025-10-16T19:16:20Z

Huh, interesting. I'm actually getting pretty reliable JIT compiler crash on the ReferenceCountUpdater::release0 method!

chrisvest · 2025-10-16T21:01:55Z

Looks like I can make the compiler crash go away by collapsing the implementations down as well, so we only have AbstractReferenceCounted and no longer need the impls in Chunk and AbstrsctReferenceCountedByteBuf. And a few other small tweaks.

chrisvest · 2025-10-16T22:16:07Z

To work-around the JIT bug, I've had to increase the scope of this PR to also collapse the three different ReferenceCounted implementations into one; AbstractReferenceCounted. This means that Chunk and AbstractReferenceCountedByteBuf no longer integrate with ReferenceCountUpdater directly, but instead all pull their implementations from AbstractReferenceCounted.

…ReferenceCounted And also work-around what appears to be a C2 JIT compiler bug in Java 17 through 25+.

franz1981 · 2025-10-17T01:55:04Z

-            }
-        }
-
+    private static class Chunk extends AbstractReferenceCounted implements ChunkInfo {


Consider that w 8953bbe I plan to make Chunk's reference count to exist only for the BumpChunk (bad name I know :"()

Yeah. And this is also breaking the graal build as-is… need to think about what to do here.

Check franz1981#1

franz1981 · 2025-10-17T02:27:19Z

The idea of collapsing the hierarchy is very similar to what @yawkat has done on franz1981#1 and will likely avoid the problem of bimoprhic inlining + fat compiled methods observed in #15736 (comment).
But the only way to know it is try this change on #15736 as well.

franz1981 · 2025-10-17T03:15:38Z

I have cherry picked both your changes on top of #15741 getting

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt  100  29.725 ± 0.261  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt  100  32.376 ± 0.453  ns/op

whilst my last commit was delivering

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt  100  27.484 ± 0.251  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt  100  32.874 ± 0.585  ns/op

which shows a slightly regressed performance in the non-"polluted" case

With the change at #15764 (comment) the performance is (nearly) restored:

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt  100  28.091 ± 0.431  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt  100  32.044 ± 0.457  ns/op

Adding #15764 (comment) too (to the previous one) completely restore the performance:

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt  100  27.636 ± 0.142  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt  100  31.993 ± 0.531  ns/op

I have the suspect we can get better on retain0 as well since we can assume to optimize for the not taken branch there too (i.e. no thrown exception) - I will post a suggestion for that one too.
And that should help the chunk and (retained slices) use cases (e.g. for HTTP 2)

chrisvest · 2025-10-17T15:55:38Z

@franz1981 I applied the optimizations you suggested. They don't seem to make any noticeable difference on M1, but if they help x86 we'll of course take that win.

chrisvest · 2025-10-30T22:58:55Z

@franz1981 @normanmaurer Please take a look.

franz1981

In term of performance it will have some impact for the most common case (as explained in the description I already gave), but in a follow up PR it enables us to likely re-enable VarHandle for native image as well.

I will run few benchmarks, likely early next week; in case, we still have the option B which would remove any chance for regression - with @normanmaurer blessing

normanmaurer · 2025-11-03T09:40:22Z

@chrisvest @franz1981 let me know once I should review

franz1981 · 2025-11-03T09:47:49Z

I think @normanmaurer this is ready to go, apart from the small comments I sent. I still believe this is bringing unwanted performance effects which the option 2. B would save, as I have explained in another comment for you 🙏

chrisvest · 2025-11-04T00:06:07Z

@normanmaurer @franz1981 Last few comments addressed.

normanmaurer · 2025-11-04T10:10:59Z


    static final int EMPTY_BYTE_BUF_HASH_CODE = 1;
-    private static final ByteBuffer EMPTY_BYTE_BUFFER = PlatformDependent.allocateDirect(0).buffer();
+    private static final ByteBuffer EMPTY_BYTE_BUFFER = ByteBuffer.allocateDirect(0);


seems unrelated...

On Graal 25, PlatformDependent.allocateDirect will use our MemorySegment based cleaner implementation. Turns out Graal will refuse to put native memory segments on the image heap because they contain native pointers which when initialized at build time may not be valid at runtime, even though in this case the segment is zero bytes in length and it would almost certainly be fine.

Since the EMPTY_BYTE_BUFFER instance is never deallocated and doesn't contribute to native memory usage (or only contributes one pointer size of bytes), I thought making this change would be a fair trade off.

normanmaurer · 2025-11-04T10:13:12Z

+ * compared to when {@link ReferenceCountUpdater} is used.
+ */
+@SuppressWarnings("deprecation")
+public class RefCnt {


hmmm after thinking a bit more about it wouldn't that basically break again on older android versions:

#15654

Looking at how #15661 solved it, we may be in the clear with our switch guards.
/cc @yawkat

@chrisvest I think the problem exists here as well as we import the VarHandle.

I think it's fine because all VarHandle uses happen in VarHandleRefCnt which is a separate class that is never loaded when VarHandle is not supported

ok if that works I am happy :)

This reverts commit a8e370d.

chrisvest requested a review from franz1981 October 16, 2025 18:16

chrisvest requested a review from normanmaurer October 16, 2025 18:20

chrisvest force-pushed the 4.2-refcount branch from 4f58b3e to c6afb03 Compare October 16, 2025 21:33

Collapse the ReferenceCounted implementations to all rely on Abstract…

f410cc2

…ReferenceCounted And also work-around what appears to be a C2 JIT compiler bug in Java 17 through 25+.

chrisvest force-pushed the 4.2-refcount branch from c6afb03 to f410cc2 Compare October 16, 2025 22:55

franz1981 reviewed Oct 17, 2025

View reviewed changes

Comment thread common/src/main/java/io/netty/util/internal/ReferenceCountUpdater.java

Comment thread common/src/main/java/io/netty/util/internal/ReferenceCountUpdater.java

franz1981 requested changes Oct 17, 2025

View reviewed changes

Comment thread common/src/main/java/io/netty/util/internal/ReferenceCountUpdater.java

franz1981 reviewed Oct 17, 2025

View reviewed changes

Comment thread common/src/main/java/io/netty/util/internal/ReferenceCountUpdater.java

franz1981 requested changes Oct 17, 2025

View reviewed changes

Comment thread common/src/main/java/io/netty/util/internal/ReferenceCountUpdater.java

Apply some micro-opts that help x86 a bit

45633cd

chrisvest added 2 commits October 17, 2025 16:31

Merge branch '4.2' into 4.2-refcount

7952927

Merge branch '4.2' into 4.2-refcount

d72d545

franz1981 reviewed Oct 21, 2025

View reviewed changes

Comment thread common/src/main/java/io/netty/util/internal/ReferenceCountUpdater.java

Fix the native image build

b077572

chrisvest force-pushed the 4.2-refcount branch from 78ccf85 to b077572 Compare October 23, 2025 19:05

chrisvest added 2 commits October 30, 2025 15:49

Fix revapi warning

9b726f9

Deprecate ReferenceCountUpdater

4c09131

chrisvest force-pushed the 4.2-refcount branch from e5c3b63 to 4c09131 Compare October 30, 2025 22:49

chrisvest requested a review from franz1981 October 30, 2025 22:51

franz1981 requested changes Oct 31, 2025

View reviewed changes

Address review comments

e0ba069

franz1981 reviewed Nov 2, 2025

View reviewed changes

Comment thread buffer/src/main/java/io/netty/buffer/AbstractReferenceCountedByteBuf.java Outdated

franz1981 requested changes Nov 2, 2025

View reviewed changes

Comment thread buffer/src/main/java/io/netty/buffer/AbstractReferenceCountedByteBuf.java Outdated

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated

normanmaurer approved these changes Nov 3, 2025

View reviewed changes

Comment thread buffer/src/main/java/io/netty/buffer/Unpooled.java Outdated

Comment thread common/src/main/java/io/netty/util/internal/RefCnt.java

chrisvest added 2 commits November 3, 2025 15:00

Merge branch '4.2' into 4.2-refcount

be180ee

Address review comments

9f0571f

normanmaurer reviewed Nov 4, 2025

View reviewed changes

normanmaurer requested changes Nov 4, 2025

View reviewed changes

franz1981 requested changes Nov 6, 2025

View reviewed changes

Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java

Make RefCnt final

3f42263

normanmaurer approved these changes Nov 6, 2025

View reviewed changes

chrisvest merged commit a8e370d into netty:4.2 Nov 6, 2025
19 checks passed

chrisvest deleted the 4.2-refcount branch November 6, 2025 20:39

This was referenced Nov 7, 2025

native-image should benefit of VarHandle #15827

Open

Make AdaptiveByteBuf.setBytes faster #15736

Merged

Improve adaptive allocator thread local performance #15741

Merged

chrisvest added a commit to chrisvest/netty that referenced this pull request Nov 7, 2025

Revert "Simplify reference counting (netty#15764)"

c85740a

This reverts commit a8e370d.

chrisvest mentioned this pull request Nov 7, 2025

Revert "Simplify reference counting (#15764)" #15831

Closed

chrisvest mentioned this pull request Nov 18, 2025

(rawCnt == 2 || rawCnt == 4 || (rawCnt & 1) == 0 equivalent to rawCnt&1 ==0 #15269

Closed

dreamlike-ocean mentioned this pull request Mar 4, 2026

Backport request: workaround for JDK-8374463 (Netty PR #15764) #16414

Closed

Songdoeon mentioned this pull request Mar 30, 2026

Enable VarHandle in native-image environments #16573

Open

Uh oh!

Conversation

chrisvest commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisvest commented Oct 16, 2025

Uh oh!

franz1981 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Oct 16, 2025

Uh oh!

chrisvest commented Oct 16, 2025

Uh oh!

franz1981 commented Oct 16, 2025

Uh oh!

chrisvest commented Oct 16, 2025

Uh oh!

franz1981 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisvest commented Oct 16, 2025

Uh oh!

chrisvest commented Oct 16, 2025

Uh oh!

chrisvest commented Oct 16, 2025

Uh oh!

franz1981 Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

franz1981 commented Oct 17, 2025

Uh oh!

Uh oh!

Uh oh!

franz1981 commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chrisvest commented Oct 17, 2025

Uh oh!

Uh oh!

chrisvest commented Oct 30, 2025

Uh oh!

franz1981 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

normanmaurer commented Nov 3, 2025

Uh oh!

franz1981 commented Nov 3, 2025

Uh oh!

Uh oh!

Uh oh!

chrisvest commented Nov 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

chrisvest commented Oct 16, 2025 •

edited

Loading

franz1981 commented Oct 16, 2025 •

edited

Loading

franz1981 commented Oct 16, 2025 •

edited

Loading

franz1981 Oct 17, 2025 •

edited

Loading

franz1981 commented Oct 17, 2025 •

edited

Loading