Improve adaptive unshared size class allocation's fast-path

Thanks to #15525 we've realized that there's still lot of room for the adaptive allocation fast-path and found lot of issues and unoptimized behaviours.

Let me summarize few of them in this issue so we can address them (eventually).

The reference benchmark is https://github.com/franz1981/netty/commit/bc334c970e2911e18cd715cf7c66aeefb3318269 which include a "fake" adaptive allocator which perform the same measured heavy (atomic) operations of the current adaptive allocator in the thread-local and size-class scenario.

By running this benchmark and profiling it vs adaptive allocator there are few issues:

### **too many (uncontended) atomics**:
- hot: chunk retain/release (xadd, cas)
- hot: segment mpsc int q's offer (cas)  
- cold: shared mpmc q's offer/polll

### **reference count checks for chunks fall off the optimized path of buffers**:
 (see https://github.com/netty/netty/pull/15525#issuecomment-3149871407)

### **weird recycling behaviour**:
The `Recycler` used for the thread local allocation of `ByteBuf` fail to inline its atomic int updater, see 

<img width="2558" height="661" alt="Image" src="https://github.com/user-attachments/assets/53d334fa-8950-4254-a677-4ee0a0794b49" />

and its cost is just too high compared to what's performed by Mimalloc (which is a linked list's top removal and link).

### **unspecialized logic**
The thread-local (unshared) size-class magazine allocation path is not optimized for such context of execution:
1. there's no need of a mpmc shared queue since no other threads can allocate "without locks" because there's no lock: we can use a mpsc one
2. there's no need to aggressively release the chunk from the magazine as soon as it looks like there's not enough capacity: none can reuse it due to the previous point
3. there's no need to know the exact size before attempting to "read into" chunk because for size chunks we just need to know if there's an available segment (this applies to shared ones as well)
4. [TO BE VERIFIED] a chunk is marked to be deallocated only if it fails to be placed in the shared queue. This decision can be taken only by the owner thread. After that, it can just observe release of segments and no new allocations, because is not visible to anyone: this info could be used to simplify the reference scheme for chunk, saving chunk retain/releases (mentioned at the beginning of this issue)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve adaptive unshared size class allocation's fast-path #15530

too many (uncontended) atomics:

reference count checks for chunks fall off the optimized path of buffers:

weird recycling behaviour:

unspecialized logic

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Improve adaptive unshared size class allocation's fast-path #15530

Description

too many (uncontended) atomics:

reference count checks for chunks fall off the optimized path of buffers:

weird recycling behaviour:

unspecialized logic

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions