Simplify reference counting#15764
Conversation
Motivation: Our current reference counting algorithm is quite complicated, trying to guard for word-tearing of 32-bit integers, which isn't possible in Java. The complicated code compiles to larger amounts of native code, which in turn can prevent inlining as it comes up against code size heuristics in the JIT. Modification: Simplify the code by removing special case checks for certain constants. Reorganize the code to use fewer branches Result: Simpler code that's more JIT friendly, without loss of performance.
|
@franz1981 Please try benchmarking these changes. On my M1, I get performance in If I change |
|
But to be fair, it matters? |
|
You can try applying this on top of my adaptive pull and see how to perform since I didn't get rid of the chunk ref count for size classed because I was waiting this ❤️ |
|
It wasn't faster without the parity bit on my machine. |
|
I believe it - I will try on my x86 as well to check how it behaves. I have to check to if it solves the problem in the issue related set bytes too |
|
Also, I suspect the contended case is relevant for the chunks. |
|
Uh, you are right. It is indeed (for shared magazines). |
|
Huh, interesting. I'm actually getting pretty reliable JIT compiler crash on the |
|
Looks like I can make the compiler crash go away by collapsing the implementations down as well, so we only have AbstractReferenceCounted and no longer need the impls in Chunk and AbstrsctReferenceCountedByteBuf. And a few other small tweaks. |
4f58b3e to
c6afb03
Compare
|
To work-around the JIT bug, I've had to increase the scope of this PR to also collapse the three different |
…ReferenceCounted And also work-around what appears to be a C2 JIT compiler bug in Java 17 through 25+.
c6afb03 to
f410cc2
Compare
| } | ||
| } | ||
|
|
||
| private static class Chunk extends AbstractReferenceCounted implements ChunkInfo { |
There was a problem hiding this comment.
Consider that w 8953bbe I plan to make Chunk's reference count to exist only for the BumpChunk (bad name I know :"()
There was a problem hiding this comment.
Yeah. And this is also breaking the graal build as-is… need to think about what to do here.
|
The idea of collapsing the hierarchy is very similar to what @yawkat has done on franz1981#1 and will likely avoid the problem of bimoprhic inlining + fat compiled methods observed in #15736 (comment). |
|
I have cherry picked both your changes on top of #15741 getting whilst my last commit was delivering which shows a slightly regressed performance in the non-"polluted" case With the change at #15764 (comment) the performance is (nearly) restored: Adding #15764 (comment) too (to the previous one) completely restore the performance: I have the suspect we can get better on |
|
@franz1981 I applied the optimizations you suggested. They don't seem to make any noticeable difference on M1, but if they help x86 we'll of course take that win. |
78ccf85 to
b077572
Compare
e5c3b63 to
4c09131
Compare
|
@franz1981 @normanmaurer Please take a look. |
franz1981
left a comment
There was a problem hiding this comment.
In term of performance it will have some impact for the most common case (as explained in the description I already gave), but in a follow up PR it enables us to likely re-enable VarHandle for native image as well.
I will run few benchmarks, likely early next week; in case, we still have the option B which would remove any chance for regression - with @normanmaurer blessing
|
@chrisvest @franz1981 let me know once I should review |
|
I think @normanmaurer this is ready to go, apart from the small comments I sent. I still believe this is bringing unwanted performance effects which the option 2. B would save, as I have explained in another comment for you 🙏 |
|
@normanmaurer @franz1981 Last few comments addressed. |
|
|
||
| static final int EMPTY_BYTE_BUF_HASH_CODE = 1; | ||
| private static final ByteBuffer EMPTY_BYTE_BUFFER = PlatformDependent.allocateDirect(0).buffer(); | ||
| private static final ByteBuffer EMPTY_BYTE_BUFFER = ByteBuffer.allocateDirect(0); |
There was a problem hiding this comment.
On Graal 25, PlatformDependent.allocateDirect will use our MemorySegment based cleaner implementation. Turns out Graal will refuse to put native memory segments on the image heap because they contain native pointers which when initialized at build time may not be valid at runtime, even though in this case the segment is zero bytes in length and it would almost certainly be fine.
Since the EMPTY_BYTE_BUFFER instance is never deallocated and doesn't contribute to native memory usage (or only contributes one pointer size of bytes), I thought making this change would be a fair trade off.
| * compared to when {@link ReferenceCountUpdater} is used. | ||
| */ | ||
| @SuppressWarnings("deprecation") | ||
| public class RefCnt { |
There was a problem hiding this comment.
hmmm after thinking a bit more about it wouldn't that basically break again on older android versions:
There was a problem hiding this comment.
@chrisvest I think the problem exists here as well as we import the VarHandle.
There was a problem hiding this comment.
I think it's fine because all VarHandle uses happen in VarHandleRefCnt which is a separate class that is never loaded when VarHandle is not supported
There was a problem hiding this comment.
ok if that works I am happy :)
This reverts commit a8e370d.
Motivation:
Our current reference counting algorithm is quite complicated, and the compiled native code size puts pressure on the JIT inliner, which in turn hurts performance.
Modification:
Simplify the code by removing special case checks for certain constants.
Reorganize the code to use fewer branches.
Remove the different uses of
ReferenceCountUpdaterand introduce aRefCnt.Avoid inheritance and use composition instead, so that we have a single implementation at runtime which helps both the C2 JIT compiler and Graal native-image.
Result:
Simpler code that's more JIT friendly, without loss of performance.