Skip to content

Simplify reference counting#15764

Merged
chrisvest merged 20 commits into
netty:4.2from
chrisvest:4.2-refcount
Nov 6, 2025
Merged

Simplify reference counting#15764
chrisvest merged 20 commits into
netty:4.2from
chrisvest:4.2-refcount

Conversation

@chrisvest

@chrisvest chrisvest commented Oct 16, 2025

Copy link
Copy Markdown
Member

Motivation:
Our current reference counting algorithm is quite complicated, and the compiled native code size puts pressure on the JIT inliner, which in turn hurts performance.

Modification:
Simplify the code by removing special case checks for certain constants.
Reorganize the code to use fewer branches.
Remove the different uses of ReferenceCountUpdater and introduce a RefCnt.
Avoid inheritance and use composition instead, so that we have a single implementation at runtime which helps both the C2 JIT compiler and Graal native-image.

Result:
Simpler code that's more JIT friendly, without loss of performance.

Motivation:
Our current reference counting algorithm is quite complicated, trying to guard for word-tearing of 32-bit integers, which isn't possible in Java.

The complicated code compiles to larger amounts of native code, which in turn can prevent inlining as it comes up against code size heuristics in the JIT.

Modification:
Simplify the code by removing special case checks for certain constants.
Reorganize the code to use fewer branches

Result:
Simpler code that's more JIT friendly, without loss of performance.
@chrisvest chrisvest requested a review from franz1981 October 16, 2025 18:16
@chrisvest

Copy link
Copy Markdown
Member Author

@franz1981 Please try benchmarking these changes. On my M1, I get performance in AbstractReferenceCountedByteBufBenchmark that is on par with the current code.

If I change retain to use a CAS loop, its performance drops dramatically as soon as there's contention. I kept the even/odd reference counting scheme because of this, because it allows us to use the getAndAdd intrinsics, which scale much better.

@franz1981

franz1981 commented Oct 16, 2025

Copy link
Copy Markdown
Contributor

But to be fair, it matters?
I mean, I would optimize for normal uncontended use cases if I have to pick a poison. This will improve the unconteded cases because the comparisons doesn't need to strip the parity bit and become (if predictable) super cheap.
If we want to go really fast (under contention) a double sequence scheme is the way because the two sequences keep on increasing and can use increment and get separately

@franz1981

Copy link
Copy Markdown
Contributor

You can try applying this on top of my adaptive pull and see how to perform since I didn't get rid of the chunk ref count for size classed because I was waiting this ❤️

@chrisvest

Copy link
Copy Markdown
Member Author

It wasn't faster without the parity bit on my machine.

@franz1981

Copy link
Copy Markdown
Contributor

I believe it - I will try on my x86 as well to check how it behaves. I have to check to if it solves the problem in the issue related set bytes too

@chrisvest

Copy link
Copy Markdown
Member Author

Also, I suspect the contended case is relevant for the chunks.

@franz1981

franz1981 commented Oct 16, 2025

Copy link
Copy Markdown
Contributor

Uh, you are right. It is indeed (for shared magazines).
Although I have another solution there to get rid of it

@chrisvest

Copy link
Copy Markdown
Member Author

Huh, interesting. I'm actually getting pretty reliable JIT compiler crash on the ReferenceCountUpdater::release0 method!

@chrisvest

Copy link
Copy Markdown
Member Author

Looks like I can make the compiler crash go away by collapsing the implementations down as well, so we only have AbstractReferenceCounted and no longer need the impls in Chunk and AbstrsctReferenceCountedByteBuf. And a few other small tweaks.

@chrisvest

Copy link
Copy Markdown
Member Author

To work-around the JIT bug, I've had to increase the scope of this PR to also collapse the three different ReferenceCounted implementations into one; AbstractReferenceCounted. This means that Chunk and AbstractReferenceCountedByteBuf no longer integrate with ReferenceCountUpdater directly, but instead all pull their implementations from AbstractReferenceCounted.

…ReferenceCounted

And also work-around what appears to be a C2 JIT compiler bug in Java 17 through 25+.
}
}

private static class Chunk extends AbstractReferenceCounted implements ChunkInfo {

@franz1981 franz1981 Oct 17, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider that w 8953bbe I plan to make Chunk's reference count to exist only for the BumpChunk (bad name I know :"()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. And this is also breaking the graal build as-is… need to think about what to do here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check franz1981#1

@franz1981

Copy link
Copy Markdown
Contributor

The idea of collapsing the hierarchy is very similar to what @yawkat has done on franz1981#1 and will likely avoid the problem of bimoprhic inlining + fat compiled methods observed in #15736 (comment).
But the only way to know it is try this change on #15736 as well.

@franz1981

franz1981 commented Oct 17, 2025

Copy link
Copy Markdown
Contributor

I have cherry picked both your changes on top of #15741 getting

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt  100  29.725 ± 0.261  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt  100  32.376 ± 0.453  ns/op

whilst my last commit was delivering

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt  100  27.484 ± 0.251  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt  100  32.874 ± 0.585  ns/op

which shows a slightly regressed performance in the non-"polluted" case

With the change at #15764 (comment) the performance is (nearly) restored:

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt  100  28.091 ± 0.431  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt  100  32.044 ± 0.457  ns/op

Adding #15764 (comment) too (to the previous one) completely restore the performance:

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt  100  27.636 ± 0.142  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt  100  31.993 ± 0.531  ns/op

I have the suspect we can get better on retain0 as well since we can assume to optimize for the not taken branch there too (i.e. no thrown exception) - I will post a suggestion for that one too.
And that should help the chunk and (retained slices) use cases (e.g. for HTTP 2)

@chrisvest

Copy link
Copy Markdown
Member Author

@franz1981 I applied the optimizations you suggested. They don't seem to make any noticeable difference on M1, but if they help x86 we'll of course take that win.

@chrisvest

Copy link
Copy Markdown
Member Author

@franz1981 @normanmaurer Please take a look.

@franz1981 franz1981 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In term of performance it will have some impact for the most common case (as explained in the description I already gave), but in a follow up PR it enables us to likely re-enable VarHandle for native image as well.

I will run few benchmarks, likely early next week; in case, we still have the option B which would remove any chance for regression - with @normanmaurer blessing

Comment thread common/src/main/java/io/netty/util/internal/RefCnt.java Outdated
Comment thread common/src/main/java/io/netty/util/internal/RefCnt.java
Comment thread common/src/main/java/io/netty/util/internal/RefCnt.java Outdated
Comment thread common/src/main/java/io/netty/util/internal/RefCnt.java Outdated
Comment thread common/src/main/java/io/netty/util/internal/RefCnt.java
Comment thread common/src/main/java/io/netty/util/internal/RefCnt.java
Comment thread buffer/src/main/java/io/netty/buffer/AbstractReferenceCountedByteBuf.java Outdated
Comment thread buffer/src/main/java/io/netty/buffer/AbstractReferenceCountedByteBuf.java Outdated
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Outdated
@normanmaurer

Copy link
Copy Markdown
Member

@chrisvest @franz1981 let me know once I should review

@franz1981

Copy link
Copy Markdown
Contributor

I think @normanmaurer this is ready to go, apart from the small comments I sent. I still believe this is bringing unwanted performance effects which the option 2. B would save, as I have explained in another comment for you 🙏

Comment thread buffer/src/main/java/io/netty/buffer/Unpooled.java Outdated
Comment thread common/src/main/java/io/netty/util/internal/RefCnt.java
@chrisvest

Copy link
Copy Markdown
Member Author

@normanmaurer @franz1981 Last few comments addressed.


static final int EMPTY_BYTE_BUF_HASH_CODE = 1;
private static final ByteBuffer EMPTY_BYTE_BUFFER = PlatformDependent.allocateDirect(0).buffer();
private static final ByteBuffer EMPTY_BYTE_BUFFER = ByteBuffer.allocateDirect(0);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems unrelated...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Graal 25, PlatformDependent.allocateDirect will use our MemorySegment based cleaner implementation. Turns out Graal will refuse to put native memory segments on the image heap because they contain native pointers which when initialized at build time may not be valid at runtime, even though in this case the segment is zero bytes in length and it would almost certainly be fine.

Since the EMPTY_BYTE_BUFFER instance is never deallocated and doesn't contribute to native memory usage (or only contributes one pointer size of bytes), I thought making this change would be a fair trade off.

* compared to when {@link ReferenceCountUpdater} is used.
*/
@SuppressWarnings("deprecation")
public class RefCnt {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm after thinking a bit more about it wouldn't that basically break again on older android versions:

#15654

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at how #15661 solved it, we may be in the clear with our switch guards.
/cc @yawkat

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisvest I think the problem exists here as well as we import the VarHandle.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine because all VarHandle uses happen in VarHandleRefCnt which is a separate class that is never loaded when VarHandle is not supported

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok if that works I am happy :)

Comment thread common/src/main/java/io/netty/util/internal/RefCnt.java Outdated
Comment thread buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants