Thanks to #15525 we've realized that there's still lot of room for the adaptive allocation fast-path and found lot of issues and unoptimized behaviours.
Let me summarize few of them in this issue so we can address them (eventually).
The reference benchmark is franz1981@bc334c9 which include a "fake" adaptive allocator which perform the same measured heavy (atomic) operations of the current adaptive allocator in the thread-local and size-class scenario.
By running this benchmark and profiling it vs adaptive allocator there are few issues:
too many (uncontended) atomics:
- hot: chunk retain/release (xadd, cas)
- hot: segment mpsc int q's offer (cas)
- cold: shared mpmc q's offer/polll
reference count checks for chunks fall off the optimized path of buffers:
(see #15525 (comment))
weird recycling behaviour:
The Recycler used for the thread local allocation of ByteBuf fail to inline its atomic int updater, see
and its cost is just too high compared to what's performed by Mimalloc (which is a linked list's top removal and link).
unspecialized logic
The thread-local (unshared) size-class magazine allocation path is not optimized for such context of execution:
- there's no need of a mpmc shared queue since no other threads can allocate "without locks" because there's no lock: we can use a mpsc one
- there's no need to aggressively release the chunk from the magazine as soon as it looks like there's not enough capacity: none can reuse it due to the previous point
- there's no need to know the exact size before attempting to "read into" chunk because for size chunks we just need to know if there's an available segment (this applies to shared ones as well)
- [TO BE VERIFIED] a chunk is marked to be deallocated only if it fails to be placed in the shared queue. This decision can be taken only by the owner thread. After that, it can just observe release of segments and no new allocations, because is not visible to anyone: this info could be used to simplify the reference scheme for chunk, saving chunk retain/releases (mentioned at the beginning of this issue)
Thanks to #15525 we've realized that there's still lot of room for the adaptive allocation fast-path and found lot of issues and unoptimized behaviours.
Let me summarize few of them in this issue so we can address them (eventually).
The reference benchmark is franz1981@bc334c9 which include a "fake" adaptive allocator which perform the same measured heavy (atomic) operations of the current adaptive allocator in the thread-local and size-class scenario.
By running this benchmark and profiling it vs adaptive allocator there are few issues:
too many (uncontended) atomics:
reference count checks for chunks fall off the optimized path of buffers:
(see #15525 (comment))
weird recycling behaviour:
The
Recyclerused for the thread local allocation ofByteBuffail to inline its atomic int updater, seeand its cost is just too high compared to what's performed by Mimalloc (which is a linked list's top removal and link).
unspecialized logic
The thread-local (unshared) size-class magazine allocation path is not optimized for such context of execution: