perf(allocator): reduce instructions on allocation hot path#21510
Conversation
How to use the Graphite Merge QueueAdd either label to this PR to merge it via the merge queue:
You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has enabled the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
Merging this PR will not alter performance
Comparing Footnotes
|
|
For the record, here's Claude's annotated assembly before/after this PR. The paths which are hit repeatedly in Oxc are:
Less path (static size,
|
There was a problem hiding this comment.
Pull request overview
Optimizes oxc_allocator’s Arena allocation fast path by simplifying alignment/space checks into fewer instructions and branches, aiming to improve hot-path performance during allocation-heavy workloads (e.g., parsing/AST building).
Changes:
- Reworks
try_alloc_layout_fastto compute the allocation pointer via a unified alignment/size calculation and a single bounds check based on the sign/top-bit property. - Removes the previous
Ordering-based match and replaces it with anif layout.align() <= MIN_ALIGNsplit. - Updates debug assertions and expands in-code rationale/commentary for the new approach.
e46cbb7 to
4a40a6b
Compare
|
@copilot The implementation is changed considerably since the first version you reviewed. Please review again carefully, starting from scratch. |
Reviewed the implementation from scratch. Here's the full analysis: Algorithm
This unified formula handles all three alignment cases. The Bounds check correctnessThe claim: In-bounds ( Out-of-bounds (
Top-half addresses: The comment (line 206-207) claims correctness without relying on addresses being in the bottom half. I verified this independently. When cursor is strictly above TOP ( Null pointer safety (line 295-297): The argument holds on all real platforms where user-space allocations are in the bottom half. On any platform where an allocation could start exactly at Comment accuracy
|
24db091 to
17997d3
Compare
Merge activity
|
Perf optimization to `Arena`. Trim off instructions from allocation hot path (`try_alloc_layout_fast`). Original implementation taken from `bumpalo` performed quite convoluted arithmetic to determine whether an allocation can be serviced within the current allocator chunk, or whether a new chunk needs to be added to the allocator. Reduce these checks to 2 simple instructions. The basis of the optimization is the invariants that: 1. A chunk's size can never be greater than `isize::MAX` bytes (because no allocation can be). 2. A`Layout`'s size can also never exceed `isize::MAX` bytes. Utilizing these invariants, the bounds checks can be collapsed to a single subtraction and branch on the sign bit flag. More details in comments in the code. When the allocation's requested alignment is larger than the arena's minimum alignment (which it is for all AST nodes, as they're all aligned on 8), this additionally collapses 2 branches into a single check. On x86, the new version uses 1 less register. This is significant as this code gets inlined into call sites, potentially reducing stack spills. | Path | Arch | Old | New | Saved | | --------------------- | ------- | ---------------------------------------------- | ---------------------------------------------- | ------------------------------------------------ | | Less | x86-64 | 7 instructions<br>1 branch<br>2 registers | 5 instructions<br>1 branch<br>1 register | -2 instructions<br> <br>-1 register | | Less | aarch64 | 8 instructions<br>1 branch<br>2 registers | 7 instructions<br>1 branch<br>2 registers | -1 instruction<br> <br> | | Less dynamic | x86-64 | 10 instructions<br>1 branch<br>3 registers | 6 instructions<br>1 branch<br>1 register | -4 instructions<br> <br>-2 registers | | Less dynamic | aarch64 | 10 instructions<br>1 branch<br>3 registers | 8 instructions<br>1 branch<br>2 registers | -2 instructions<br> <br>-1 register | | Equal | x86-64 | 7 instructions<br>1 branch<br>2 registers | 5 instructions<br>1 branch<br>1 register | -2 instructions<br> <br>-1 register | | Equal | aarch64 | 8 instructions<br>1 branch<br>2 registers | 7 instructions<br>1 branch<br>2 registers | -1 instruction<br> <br> | | Equal dynamic | x86-64 | 8 instructions<br>1 branch<br>3 registers | 5 instructions<br>1 branch<br>1 register | -3 instructions<br> <br>-2 registers | | Equal dynamic | aarch64 | 9 instructions<br>1 branch<br>3 registers | 7 instructions<br>1 branch<br>2 registers | -2 instructions<br> <br>-1 register | | Greater | x86-64 | 9 instructions<br>2 branches<br>2 registers | 6 instructions<br>1 branch<br>1 register | -3 instructions<br>-1 branch<br>-1 register | | Greater | aarch64 | 9 instructions<br>2 branches<br>2 registers | 8 instructions<br>1 branch<br>2 registers | -1 instruction<br>-1 branch<br> | | Greater dynamic | x86-64 | 9 instructions<br>2 branches<br>2 registers | 6 instructions<br>1 branch<br>1 register | -3 instructions<br>-1 branch<br>-1 register | | Greater dynamic | aarch64 | 9 instructions<br>1 branch<br>2 registers | 8 instructions<br>1 branch<br>2 registers | -1 instruction<br> <br> | 1% perf improvement on all parser benchmarks.
17997d3 to
3cf7cef
Compare
Fix a test for `Bump` which was failing on 32-bit in CI. #21291 temporarily disabled the test. Turn out, there's no easy way to make this test work on 32-bit platforms, and #21510 added ample unit tests which effectively test the same thing. So keep it disabled on 32-bit, but fix the test to make it work properly on 64-bit. Rationale for the fix is explained in fitzgen/bumpalo#320.
### 💥 BREAKING CHANGES - 502e804 ast: [**BREAKING**] Reduce size of `TSTypePredicateName` (#21711) (overlookmotel) - 5651539 ast: [**BREAKING**] Reduce size of `JSXExpression` (#21710) (overlookmotel) - c44e280 ast: [**BREAKING**] Reduce size of `ArrayExpressionElement` (#21709) (overlookmotel) - c5b3deb syntax: [**BREAKING**] Remove `CommentNodeId` (#21679) (overlookmotel) ### 🚀 Features - b738a39 allocator: Add `Allocator::cursor_ptr` method (#21773) (overlookmotel) - 678767e ast: Generate node_id accessors for AST enum wrappers (#21653) (camc314) - f091d77 minifier: Inline constant spread elements into arrays (#21095) (Armano) ### 🐛 Bug Fixes - 0d608c2 minifier: Preserve raw CR in template literals (#21645) (Dunqing) - a889ea9 minifier: Track pure functions in DCE mode (#21722) (Dunqing) - 674dfac allocator: `Arena` retry allocation when chunk size approaches maximum (#21777) (overlookmotel) - f130cc0 allocator: Fix arithmetic overflow in `Arena::new_chunk_memory_details` (#21745) (overlookmotel) - b9bf239 allocator: Fix UB in `Arena::grow_zeroed` (#21739) (overlookmotel) - d2b9389 allocator: Clippy warning when building without `testing` feature (#21681) (camc314) - 503dc86 codegen: Map sourcemaps from visible output starts (#21662) (Dunqing) - c92bd3b transformer: Use SPAN for synthesized helper calls to prevent comment misattribution (#21578) (Dunqing) - 0d80441 codegen: Add mapping before printing `#` for private ident (#21619) (camc314) ### ⚡ Performance - 9fa362e napi/parser: Do not generate tokens except in tests (#21811) (overlookmotel) - 0044392 allocator: Reduce branches when allocating new chunk (#21776) (overlookmotel) - 7896bd0 allocator: `Allocator::used_bytes` do not use chunk iterator (#21771) (overlookmotel) - a5c562f allocator: Remove check in `Arena::new_chunk_memory_details` (#21750) (overlookmotel) - 35bbe1f allocator: `Arena` use unchecked size round up where guaranteed no overflow (#21743) (overlookmotel) - ffe229b allocator: Remove unnecessary check from `Arena::try_alloc_layout_slow_impl` (#21732) (overlookmotel) - 72fece5 allocator: Use `NonNull::offset_from_unsigned` in `Arena::chunk_capacity` (#21731) (overlookmotel) - cab32ae ast: Add `#[inline(always)]` to `node_id` methods on enums with all variants unboxed (#21707) (overlookmotel) - b179688 parser: Allocate `TriviaBuilder` comments in the arena (#21512) (Boshen) - 2290f31 lexer: Fix perf of `Token::set_*` methods on Rust 1.95.0 (#21659) (overlookmotel) - 1b58029 allocator: Move code into cold path in `Arena::alloc_layout` (#21622) (overlookmotel) - 3cf7cef allocator: Reduce instructions on allocation hot path (#21510) (overlookmotel) ### 📚 Documentation - ce65070 data_structures: Document why `as_ref` and `as_mut` on `NonNullConst` and `NonNullMut` take `self` (#21800) (overlookmotel) - 93b7dbd allocator: Improve doc comments for `ChunkFooter` (#21733) (overlookmotel) - 295db8d transformer: Fix comment (#21717) (overlookmotel) - 5c93af8 ast: Add comments explaining `#[inline(always)]` to `node_id` methods on enums (#21706) (overlookmotel) - e4cea25 transform: Use the `node:` namespace in the example (#19998) (루밀LuMir) ### 🛡️ Security - d8076c9 deps: Update rolldown (#21639) (renovate) Co-authored-by: Boshen <1430279+Boshen@users.noreply.github.com>

Perf optimization to
Arena. Trim off instructions from allocation hot path (try_alloc_layout_fast).Original implementation taken from
bumpaloperformed quite convoluted arithmetic to determine whether an allocation can be serviced within the current allocator chunk, or whether a new chunk needs to be added to the allocator.Reduce these checks to 2 simple instructions.
The basis of the optimization is the invariants that:
isize::MAXbytes (because no allocation can be).Layout's size can also never exceedisize::MAXbytes.Utilizing these invariants, the bounds checks can be collapsed to a single subtraction and branch on the sign bit flag. More details in comments in the code.
When the allocation's requested alignment is larger than the arena's minimum alignment (which it is for all AST nodes, as they're all aligned on 8), this additionally collapses 2 branches into a single check.
On x86, the new version uses 1 less register. This is significant as this code gets inlined into call sites, potentially reducing stack spills.
1 branch
2 registers
1 branch
1 register
-1 register
1 branch
2 registers
1 branch
2 registers
1 branch
3 registers
1 branch
1 register
-2 registers
1 branch
3 registers
1 branch
2 registers
-1 register
1 branch
2 registers
1 branch
1 register
-1 register
1 branch
2 registers
1 branch
2 registers
1 branch
3 registers
1 branch
1 register
-2 registers
1 branch
3 registers
1 branch
2 registers
-1 register
2 branches
2 registers
1 branch
1 register
-1 branch
-1 register
2 branches
2 registers
1 branch
2 registers
-1 branch
2 branches
2 registers
1 branch
1 register
-1 branch
-1 register
1 branch
2 registers
1 branch
2 registers
1% perf improvement on all parser benchmarks.