Remove reference counting from size classed chunks#16306
Conversation
|
FYI @laosijikaichele this should deliver quite a decent boost 🗡️ @chrisvest sadly I have left the ref cnt to still be allocated per chunk, due to other optimizations, but it shouldn't be used |
|
This is ready to go @normanmaurer in term of perf numbers I have already performed tests a lot for other prs on this, and the numbers were showing 1.5X or 2.X better perf for the thread local allocations |
0eefff6 to
842abd9
Compare
|
@laosijikaichele I will check if using your perf test on mimalloc, but sadly we don't have the same HW so I will likely get some different numbers 😢 |
|
@franz1981 great work, no worries about the hardware, we can mainly compare the number change on |
e5751d6 to
21f7420
Compare
|
PTAL @normanmaurer I have simplified the capacity method to make it inlineable, but TBH it's the logic there which concern me, since mpsc::size is not "trivial" (is a loop!) |
|
@laosijikaichele I was wrong, it seems to improve, on my x86 by a mere ~7-8% atm, maybe you will have better luck with aarch64? |
|
When i see things like these my heart is bleeding 😢 0.10% 0x00007f1ddc2103cb: jg 0x00007f1ddc21118c ;*if_icmpgt {reexecute=0 rethrow=0 return_oop=0}
; - io.netty.buffer.AdaptivePoolingAllocator::allocate@7 (line 260)
0x00007f1ddc2103d1: lea 0x1f(%rdx),%r8d
4.09% 0x00007f1ddc2103d5: sar $0x5,%r8d ;*ishr {reexecute=0 rethrow=0 return_oop=0}
; - io.netty.buffer.AdaptivePoolingAllocator::sizeIndexOf@5 (line 282)
; - io.netty.buffer.AdaptivePoolingAllocator::sizeClassIndexOf@1 (line 286)
; - io.netty.buffer.AdaptivePoolingAllocator::allocate@11 (line 261)I know that is a data dep but, c'mon...on Intel this the lea is faster... |
dfb6661 to
674ce6b
Compare
|
@laosijikaichele i would be curious how the numbers now look vs mimalloc to understand the gap left (assuming proper sized chunk q 😢) |
chrisvest
left a comment
There was a problem hiding this comment.
One nit but overall this looks good.
674ce6b to
81de40e
Compare
|
Results on my machine and thread local one Not wow, but not terrible. |
d0afc55 to
d921c93
Compare
|
And the other hidden performance benefit is that we now have scalable allocate/release from different threads - because it will scale as much as the underlying mpsc queue, while before it was "limited" by the shared "chunk" ref cnt. It's a pathological case, but still, worth mentioning. |
|
PTAL @normanmaurer if you're happy with the first numbers at #16306 (comment) |
Fair enough 👍 |
|
Revert the io_uring changes. We can rebase on #16359 when it goes in. The builds got cancelled after 6 hours and didn't produce any logs, which is unfortunate as we don't know if it's the allocator changes or the io_uring changes that made them get stuck. |
Motivation: SizeClassedChunk performs 2 atomic ops (retain/release) per allocation cycle on the hot path. Modification: Replace ref counting with a segment-count state machine that only needs atomics on the cold deallocation path. Result: No more per-allocation atomic operations for SizeClassedChunk.
…tivePoolingAllocator
… and capacity checks
6c181fb to
dec518f
Compare
|
PTAL @normanmaurer |
|
@chrisvest done bud ! Let's see if the CI is happy ;) |
|
CI is green @normanmaurer @chrisvest so, ready to go? |
|
Very nice. Thanks! |
|
Could not create auto-port PR. |
Motivation: SizeClassedChunk performs 2 atomic ops (retain/release) per allocation cycle on the hot path. Modification: Replace ref counting with a segment-count state machine that only needs atomics on the cold deallocation path. Result: No more per-allocation atomic operations for SizeClassedChunk. (cherry picked from commit de25e7a)
|
Auto-port PR for 5.0: #16378 |
Motivation: SizeClassedChunk performs 2 atomic ops (retain/release) per allocation cycle on the hot path. Modification: Replace ref counting with a segment-count state machine that only needs atomics on the cold deallocation path. Result: No more per-allocation atomic operations for SizeClassedChunk. (cherry picked from commit de25e7a)
|
Port PR for 4.1: #16379 |
Motivation: SizeClassedChunk performs 2 atomic ops (retain/release) per allocation cycle on the hot path. Modification: Replace ref counting with a segment-count state machine that only needs atomics on the cold deallocation path. Result: No more per-allocation atomic operations for SizeClassedChunk. (cherry picked from commit de25e7a) Co-authored-by: Francesco Nigro <nigro.fra@gmail.com>
…6378) Auto-port of #16306 to 5.0 Cherry-picked commit: de25e7a --- Motivation: SizeClassedChunk performs 2 atomic ops (retain/release) per allocation cycle on the hot path. Modification: Replace ref counting with a segment-count state machine that only needs atomics on the cold deallocation path. Result: No more per-allocation atomic operations for SizeClassedChunk. Co-authored-by: Francesco Nigro <nigro.fra@gmail.com>
Motivation:
SizeClassedChunk performs 2 atomic ops (retain/release) per allocation cycle on the hot path.
Modification:
Replace ref counting with a segment-count state machine that only needs atomics on the cold deallocation path.
Result:
No more per-allocation atomic operations for SizeClassedChunk.