Skip to content

perf(allocator): reduce instructions on allocation hot path#21510

Merged
graphite-app[bot] merged 1 commit intomainfrom
om/04-16-perf_allocator_reduce_instructions_on_allocation_hot_path
Apr 22, 2026
Merged

perf(allocator): reduce instructions on allocation hot path#21510
graphite-app[bot] merged 1 commit intomainfrom
om/04-16-perf_allocator_reduce_instructions_on_allocation_hot_path

Conversation

@overlookmotel
Copy link
Copy Markdown
Member

@overlookmotel overlookmotel commented Apr 16, 2026

Perf optimization to Arena. Trim off instructions from allocation hot path (try_alloc_layout_fast).

Original implementation taken from bumpalo performed quite convoluted arithmetic to determine whether an allocation can be serviced within the current allocator chunk, or whether a new chunk needs to be added to the allocator.

Reduce these checks to 2 simple instructions.

The basis of the optimization is the invariants that:

  1. A chunk's size can never be greater than isize::MAX bytes (because no allocation can be).
  2. ALayout's size can also never exceed isize::MAX bytes.

Utilizing these invariants, the bounds checks can be collapsed to a single subtraction and branch on the sign bit flag. More details in comments in the code.

When the allocation's requested alignment is larger than the arena's minimum alignment (which it is for all AST nodes, as they're all aligned on 8), this additionally collapses 2 branches into a single check.

On x86, the new version uses 1 less register. This is significant as this code gets inlined into call sites, potentially reducing stack spills.

Path Arch Old New Saved
Less x86-64 7 instructions
1 branch
2 registers
5 instructions
1 branch
1 register
-2 instructions
 
-1 register
Less aarch64 8 instructions
1 branch
2 registers
7 instructions
1 branch
2 registers
-1 instruction
 
 
Less dynamic x86-64 10 instructions
1 branch
3 registers
6 instructions
1 branch
1 register
-4 instructions
 
-2 registers
Less dynamic aarch64 10 instructions
1 branch
3 registers
8 instructions
1 branch
2 registers
-2 instructions
 
-1 register
Equal x86-64 7 instructions
1 branch
2 registers
5 instructions
1 branch
1 register
-2 instructions
 
-1 register
Equal aarch64 8 instructions
1 branch
2 registers
7 instructions
1 branch
2 registers
-1 instruction
 
 
Equal dynamic x86-64 8 instructions
1 branch
3 registers
5 instructions
1 branch
1 register
-3 instructions
 
-2 registers
Equal dynamic aarch64 9 instructions
1 branch
3 registers
7 instructions
1 branch
2 registers
-2 instructions
 
-1 register
Greater x86-64 9 instructions
2 branches
2 registers
6 instructions
1 branch
1 register
-3 instructions
-1 branch
-1 register
Greater aarch64 9 instructions
2 branches
2 registers
8 instructions
1 branch
2 registers
-1 instruction
-1 branch
 
Greater dynamic x86-64 9 instructions
2 branches
2 registers
6 instructions
1 branch
1 register
-3 instructions
-1 branch
-1 register
Greater dynamic aarch64 9 instructions
1 branch
2 registers
8 instructions
1 branch
2 registers
-1 instruction
 
 

1% perf improvement on all parser benchmarks.

Copy link
Copy Markdown
Member Author

overlookmotel commented Apr 16, 2026


How to use the Graphite Merge Queue

Add either label to this PR to merge it via the merge queue:

  • 0-merge - adds this PR to the back of the merge queue
  • hotfix - for urgent changes, fast-track this PR to the front of the merge queue

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@github-actions github-actions Bot added A-allocator Area - Allocator C-performance Category - Solution not expected to change functional behavior, only performance labels Apr 16, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 16, 2026

Merging this PR will not alter performance

✅ 48 untouched benchmarks
⏩ 3 skipped benchmarks1


Comparing om/04-16-perf_allocator_reduce_instructions_on_allocation_hot_path (3cf7cef) with main (ca81199)2

Open in CodSpeed

Footnotes

  1. 3 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on main (63eb98b) during the generation of this report, so ca81199 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@overlookmotel overlookmotel self-assigned this Apr 16, 2026
@overlookmotel overlookmotel marked this pull request as ready for review April 16, 2026 20:55
Copilot AI review requested due to automatic review settings April 16, 2026 20:55
@overlookmotel
Copy link
Copy Markdown
Member Author

overlookmotel commented Apr 16, 2026

For the record, here's Claude's annotated assembly before/after this PR.

The paths which are hit repeatedly in Oxc are:

  • "Greater (static)" - AST node allocation
  • "Equal (dynamic)" - Arena::alloc_str

Less path (static size, align < MIN_ALIGN)

Arena<8>::try_alloc_layout(Layout(29, 1))

Old (x86-64)

mov  rax, [rdi]              ; load cursor_ptr
mov  rcx, rax                ; copy cursor (need original later for new_ptr)
sub  rcx, [rdi + 16]         ; capacity = cursor - start_ptr
cmp  rcx, 32                 ; capacity >= 32? (29 rounded up to MIN_ALIGN=8 = 32)
jb   .slow                   ; if not, branch to slow path
add  rax, -32                ; new_ptr = cursor - 32
mov  [rdi], rax              ; store new cursor
ret

New (x86-64)

mov  rax, [rdi]              ; load cursor_ptr
add  rax, -32                ; new_ptr = cursor - 32
cmp  rax, [rdi + 16]         ; new_ptr vs start_ptr
js   .slow                   ; if signed, tail-call slow path
mov  [rdi], rax              ; store new cursor
ret

Old (aarch64)

ldr  x8, [x0]                ; load cursor_ptr
ldr  x9, [x0, #16]           ; load start_ptr
sub  x9, x8, x9              ; capacity = cursor - start_ptr
cmp  x9, #32                 ; capacity >= 32?
b.lo .slow                   ; if not, branch to slow path
sub  x8, x8, #32             ; new_ptr = cursor - 32
str  x8, [x0]                ; store new cursor
mov  x0, x8                  ; move result to return register
ret

New (aarch64)

ldr  x8, [x0]                ; load cursor_ptr
ldr  x9, [x0, #16]           ; load start_ptr
sub  x8, x8, #32             ; new_ptr = cursor - 32 (AND elided)
cmp  x8, x9                  ; new_ptr vs start_ptr
b.mi .slow                   ; branch on negative flag
str  x8, [x0]                ; store new cursor
mov  x0, x8                  ; return register
ret

Less path dynamic (dynamic size, align < MIN_ALIGN)

Arena<8>::try_alloc_layout(Layout(size, 1))

Old (x86-64)

mov  rdx, rsi                ; save size for slow-path call
mov  rax, [rdi]              ; load cursor_ptr
lea  rcx, [rsi + 7]          ; size + 7 (round-up step 1)
and  rcx, -8                 ; aligned_size = (size + 7) & ~7 (round-up step 2)
mov  rsi, rax                ; copy cursor
sub  rsi, [rdi + 16]         ; capacity = cursor - start_ptr
cmp  rcx, rsi                ; aligned_size > capacity?
ja   .slow                   ; if so, branch to slow path
sub  rax, rcx                ; new_ptr = cursor - aligned_size
mov  [rdi], rax              ; store new cursor
ret

New (x86-64)

mov  rax, [rdi]              ; load cursor_ptr
sub  rax, rsi                ; cursor - size
and  rax, -8                 ; round down to MIN_ALIGN = new_ptr
cmp  rax, [rdi + 16]         ; new_ptr vs start_ptr
js   .slow                   ; tail-call slow path on out of bounds
mov  [rdi], rax              ; store new cursor
ret

Old (aarch64)

ldr  x8, [x0]                ; load cursor_ptr
ldr  x10, [x0, #16]          ; load start_ptr
add  x9, x1, #7              ; size + 7
and  x9, x9, #~7             ; aligned_size = (size + 7) & ~7
sub  x10, x8, x10            ; capacity = cursor - start_ptr
cmp  x9, x10                 ; aligned_size > capacity?
b.hi .slow                   ; if so, branch to slow path
sub  x8, x8, x9              ; new_ptr = cursor - aligned_size
str  x8, [x0]                ; store new cursor
mov  x0, x8                  ; return register
ret

New (aarch64)

ldr  x8, [x0]                ; load cursor_ptr
ldr  x9, [x0, #16]           ; load start_ptr
sub  x8, x8, x1              ; cursor - size
and  x8, x8, #~7             ; round down to MIN_ALIGN
cmp  x8, x9                  ; new_ptr vs start_ptr
b.mi .slow                   ; branch on negative
str  x8, [x0]                ; store new cursor
mov  x0, x8                  ; return register
ret

Equal path (static size, align == MIN_ALIGN)

Arena<8>::try_alloc_layout(Layout(32, 8))

Old (x86-64)

mov  rax, [rdi]              ; load cursor_ptr
mov  rcx, rax                ; copy cursor
sub  rcx, [rdi + 16]         ; capacity = cursor - start_ptr
cmp  rcx, 32                 ; capacity >= 32?
jb   .slow                   ; if not, branch to slow path
add  rax, -32                ; new_ptr = cursor - 32
mov  [rdi], rax              ; store new cursor
ret

New (x86-64)

mov  rax, [rdi]              ; load cursor_ptr
add  rax, -32                ; new_ptr = cursor - 32 (AND elided)
cmp  rax, [rdi + 16]         ; new_ptr vs start_ptr
js   .slow                   ; tail-call slow path
mov  [rdi], rax              ; store new cursor
ret

Old (aarch64)

ldr  x8, [x0]                ; load cursor_ptr
ldr  x9, [x0, #16]           ; load start_ptr
sub  x9, x8, x9              ; capacity = cursor - start_ptr
cmp  x9, #32                 ; capacity >= 32?
b.lo .slow                   ; if not, branch to slow path
sub  x8, x8, #32             ; new_ptr = cursor - 32
str  x8, [x0]                ; store new cursor
mov  x0, x8                  ; return register
ret

New (aarch64)

ldr  x8, [x0]                ; load cursor_ptr
ldr  x9, [x0, #16]           ; load start_ptr
sub  x8, x8, #32             ; new_ptr = cursor - 32 (AND elided)
cmp  x8, x9                  ; new_ptr vs start_ptr
b.mi .slow                   ; branch on negative
str  x8, [x0]                ; store new cursor
mov  x0, x8                  ; return register
ret

Equal path dynamic (dynamic size known multiple of 8, align == MIN_ALIGN)

Arena<8>::try_alloc_layout(Layout(size, 8))

Old (x86-64)

mov  rdx, rsi                ; save size for slow path
mov  rax, [rdi]              ; load cursor_ptr
mov  rcx, rax                ; copy cursor
sub  rcx, [rdi + 16]         ; capacity = cursor - start_ptr
cmp  rsi, rcx                ; size > capacity?
ja   .slow                   ; if so, branch to slow path
sub  rax, rdx                ; new_ptr = cursor - size
mov  [rdi], rax              ; store new cursor
ret

New (x86-64)

mov  rax, [rdi]              ; load cursor_ptr
sub  rax, rsi                ; new_ptr = cursor - size (AND elided via is_multiple_of(8) hint)
cmp  rax, [rdi + 16]         ; new_ptr vs start_ptr
js   .slow                   ; tail-call slow path
mov  [rdi], rax              ; store new cursor
ret

Old (aarch64)

mov  x2, x1                  ; save size for slow path
ldr  x8, [x0]                ; load cursor_ptr
ldr  x9, [x0, #16]           ; load start_ptr
sub  x9, x8, x9              ; capacity = cursor - start_ptr
cmp  x1, x9                  ; size > capacity?
b.hi .slow                   ; if so, branch to slow path
sub  x8, x8, x2              ; new_ptr = cursor - size
str  x8, [x0]                ; store new cursor
mov  x0, x8                  ; return register
ret

New (aarch64)

ldr  x8, [x0]                ; load cursor_ptr
ldr  x9, [x0, #16]           ; load start_ptr
sub  x8, x8, x1              ; new_ptr = cursor - size (AND elided)
cmp  x8, x9                  ; new_ptr vs start_ptr
b.mi .slow                   ; branch on negative
str  x8, [x0]                ; store new cursor
mov  x0, x8                  ; return register
ret

Greater path (static size, align > MIN_ALIGN)

Arena<1>::try_alloc_layout(Layout(32, 8))

Old (x86-64)

mov  rax, [rdi]              ; load cursor_ptr
and  rax, -8                 ; aligned_ptr = cursor & ~7 (round down to align 8)
mov  rcx, rax                ; copy aligned_ptr
sub  rcx, [rdi + 16]         ; capacity = aligned_ptr - start_ptr (sets carry on underflow)
jb   .slow                   ; branch 1: aligned_ptr < start_ptr (underflow)
cmp  rcx, 32                 ; capacity >= 32?
jb   .slow                   ; branch 2: not enough room
add  rax, -32                ; new_ptr = aligned_ptr - 32
mov  [rdi], rax              ; store new cursor
ret

New (x86-64)

mov  rax, [rdi]              ; load cursor_ptr
add  rax, -32                ; cursor - 32
and  rax, -8                 ; round down to align 8 = new_ptr
cmp  rax, [rdi + 16]         ; new_ptr vs start_ptr
js   .slow                   ; tail-call slow path
mov  [rdi], rax              ; store new cursor
ret

Old (aarch64)

ldr  x8, [x0]                ; load cursor_ptr
ldr  x9, [x0, #16]           ; load start_ptr
and  x8, x8, #~7             ; aligned_ptr = cursor & ~7
subs x9, x8, x9              ; capacity = aligned_ptr - start (sets flags)
b.lo .slow                   ; branch 1: aligned_ptr < start (underflow)
cmp  x9, #32                 ; capacity >= 32?
b.lo .slow                   ; branch 2: not enough room
sub  x8, x8, #32             ; new_ptr = aligned_ptr - 32
str  x8, [x0]                ; store new cursor
mov  x0, x8                  ; return register
ret

New (aarch64)

ldr  x8, [x0]                ; load cursor_ptr
ldr  x9, [x0, #16]           ; load start_ptr
sub  x8, x8, #32             ; cursor - 32
and  x8, x8, #~7             ; round down to align 8 = new_ptr
cmp  x8, x9                  ; new_ptr vs start_ptr
b.mi .slow                   ; branch on negative
str  x8, [x0]                ; store new cursor
mov  x0, x8                  ; return register
ret

Greater path dynamic (dynamic size, align > MIN_ALIGN)

Arena<1>::try_alloc_layout(Layout(size, 8))

Old (x86-64)

mov  rax, [rdi]              ; load cursor_ptr
and  rax, -8                 ; aligned_ptr = cursor & ~7
mov  rcx, rax                ; copy aligned_ptr
sub  rcx, [rdi + 16]         ; capacity = aligned_ptr - start_ptr (sets carry on underflow)
jb   .slow                   ; branch 1: aligned_ptr < start (underflow) — tail call
cmp  rsi, rcx                ; size > capacity?
ja   .slow                   ; branch 2: not enough room — tail call
sub  rax, rsi                ; new_ptr = aligned_ptr - size
mov  [rdi], rax              ; store new cursor
ret

New (x86-64)

mov  rax, [rdi]              ; load cursor_ptr
sub  rax, rsi                ; cursor - size
and  rax, -8                 ; round down to align 8 = new_ptr
cmp  rax, [rdi + 16]         ; new_ptr vs start_ptr
js   .slow                   ; tail-call slow path
mov  [rdi], rax              ; store new cursor
ret

Old (aarch64)

ldr  x8, [x0]                ; load cursor_ptr
ldr  x9, [x0, #16]           ; load start_ptr
and  x8, x8, #~7             ; aligned_ptr = cursor & ~7
subs x9, x8, x9              ; capacity = aligned_ptr - start (sets flags)
ccmp x1, x9, #2, hs          ; if no underflow: compare size with capacity, else set "out of bounds"
b.hi .slow                   ; branch on either failure
sub  x8, x8, x1              ; new_ptr = aligned_ptr - size
str  x8, [x0]                ; store new cursor
mov  x0, x8                  ; return register
ret

New (aarch64)

ldr  x8, [x0]                ; load cursor_ptr
ldr  x9, [x0, #16]           ; load start_ptr
sub  x8, x8, x1              ; cursor - size
and  x8, x8, #~7             ; round down to align 8 = new_ptr
cmp  x8, x9                  ; new_ptr vs start_ptr
b.mi .slow                   ; branch on negative
str  x8, [x0]                ; store new cursor
mov  x0, x8                  ; return register
ret

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes oxc_allocator’s Arena allocation fast path by simplifying alignment/space checks into fewer instructions and branches, aiming to improve hot-path performance during allocation-heavy workloads (e.g., parsing/AST building).

Changes:

  • Reworks try_alloc_layout_fast to compute the allocation pointer via a unified alignment/size calculation and a single bounds check based on the sign/top-bit property.
  • Removes the previous Ordering-based match and replaces it with an if layout.align() <= MIN_ALIGN split.
  • Updates debug assertions and expands in-code rationale/commentary for the new approach.

Comment thread crates/oxc_allocator/src/arena/alloc_impl.rs Outdated
Comment thread crates/oxc_allocator/src/arena/alloc_impl.rs Outdated
@overlookmotel overlookmotel force-pushed the om/04-16-perf_allocator_reduce_instructions_on_allocation_hot_path branch 2 times, most recently from e46cbb7 to 4a40a6b Compare April 21, 2026 22:53
@overlookmotel overlookmotel requested a review from Copilot April 21, 2026 23:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread crates/oxc_allocator/src/arena/alloc_impl.rs
@overlookmotel
Copy link
Copy Markdown
Member Author

@copilot The implementation is changed considerably since the first version you reviewed. Please review again carefully, starting from scratch.

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 21, 2026

@copilot The implementation is changed considerably since the first version you reviewed. Please review again carefully, starting from scratch.

Reviewed the implementation from scratch. Here's the full analysis:


Algorithm

new_ptr = round_down(cursor - size, max(align, MIN_ALIGN))

This unified formula handles all three alignment cases. The round_mut_ptr_down_to implementation using wrapping_sub (rather than AND-masking) preserves pointer provenance and critically enables LLVM's known-bits analysis to fold (cursor - constant_size) - constant_low_bits into a single cursor - constant for statically-known sizes. The assert_unchecked! on cursor alignment is the essential hint that makes this fold possible.


Bounds check correctness

The claim: new_ptr.wrapping_sub(start) > isize::MAX iff new_ptr < start.

In-bounds (new_ptr >= start): new_ptr - start doesn't wrap, and is <= cursor - start <= isize::MAX. ✓

Out-of-bounds (new_ptr < start): The proof in the comments is sound.

  • Max total subtraction is isize::MAX + 1 (the less/equal path reaching round_up(isize::MAX, MIN_ALIGN) when MIN_ALIGN is a power of 2).
  • The minimum out-of-bounds wrapping result is isize::MAX + 1 (the extremal case of empty chunk + max total subtraction).
  • All other out-of-bounds cases give larger wrapping results, correctly detected by > isize::MAX.

Top-half addresses: The comment (line 206-207) claims correctness without relying on addresses being in the bottom half. I verified this independently. When cursor is strictly above TOP (2^63), the Layout invariant (round_up(size, align) <= isize::MAX) implies that for any power-of-2 alignment, isize::MAX ≡ align-1 (mod align), so the max valid size is isize::MAX - (align-1). This means cursor - max_valid_size >= align, so new_ptr >= align after rounding. Since start - new_ptr <= (TOP + delta) - (delta + 1) = isize::MAX, the wrapping difference is >= 2^63 + 1 > isize::MAX. The check is always correct. ✓

Null pointer safety (line 295-297): The argument holds on all real platforms where user-space allocations are in the bottom half. On any platform where an allocation could start exactly at TOP, 0.wrapping_sub(TOP) = isize::MAX + 1 > isize::MAX still catches it. For starts strictly above TOP, the analysis above shows new_ptr can't be null with valid Layout sizes. ✓


Comment accuracy

  • Line 125: "maximum subtraction before rounding is isize::MAX - align + 1" — correct: for align > 1, since round_up(size, align) <= isize::MAX and isize::MAX ≡ align-1 (mod align), the max valid size is exactly isize::MAX - (align-1).
  • Line 191: "So new_ptr.wrapping_sub(start_ptr) > isize::MAX + 1" — correct for "all other" out-of-bounds cases (i.e., not the minimum extremal case). Together with the minimum case at isize::MAX + 1, all out-of-bounds results are >= isize::MAX + 1.
  • Lines 184-187: The arithmetic for the extremal case (0.wrapping_sub(isize::MAX + 1) = isize::MAX + 1) is correct.

unsafe { assert_unchecked! } soundness

Sound: Arena maintains cursor alignment to MIN_ALIGN at all mutation points — initial creation (footer is CHUNK_ALIGN-aligned and CHUNK_ALIGN >= MIN_ALIGN), try_alloc_layout_fast (result of round_mut_ptr_down_to with divisor >= MIN_ALIGN), alloc_layout_slow (sets cursor to new footer), and dealloc (rounds up to MIN_ALIGN). ✓


Test coverage

The test suite is comprehensive. It covers: bottom-half, top-half, spanning-midpoint, near-usize::MAX, all three alignment paths (less/equal/greater), ZSTs, consecutive allocations, off-by-ones, max valid Layout sizes for each path, boundary total-subtraction exactly at isize::MAX + 1, and various MIN_ALIGN values (1, 8, 16). All 60 tests pass.


No issues found. The implementation is correct, the invariant proof is sound, and the test suite provides thorough coverage of the edge cases.

@overlookmotel overlookmotel force-pushed the om/04-16-perf_allocator_reduce_instructions_on_allocation_hot_path branch from 24db091 to 17997d3 Compare April 21, 2026 23:38
@graphite-app
Copy link
Copy Markdown
Contributor

graphite-app Bot commented Apr 21, 2026

Merge activity

Perf optimization to `Arena`. Trim off instructions from allocation hot path (`try_alloc_layout_fast`).

Original implementation taken from `bumpalo` performed quite convoluted arithmetic to determine whether an allocation can be serviced within the current allocator chunk, or whether a new chunk needs to be added to the allocator.

Reduce these checks to 2 simple instructions.

The basis of the optimization is the invariants that:

1. A chunk's size can never be greater than `isize::MAX` bytes (because no allocation can be).
2. A`Layout`'s size can also never exceed `isize::MAX` bytes.

Utilizing these invariants, the bounds checks can be collapsed to a single subtraction and branch on the sign bit flag. More details in comments in the code.

When the allocation's requested alignment is larger than the arena's minimum alignment (which it is for all AST nodes, as they're all aligned on 8), this additionally collapses 2 branches into a single check.

On x86, the new version uses 1 less register. This is significant as this code gets inlined into call sites, potentially reducing stack spills.

| Path                  | Arch    | Old                                            | New                                            | Saved                                            |
| --------------------- | ------- | ---------------------------------------------- | ---------------------------------------------- | ------------------------------------------------ |
| Less                  | x86-64  | 7 instructions<br>1 branch<br>2 registers      | 5 instructions<br>1 branch<br>1 register       | -2 instructions<br>&nbsp;<br>-1 register         |
| Less                  | aarch64 | 8 instructions<br>1 branch<br>2 registers      | 7 instructions<br>1 branch<br>2 registers      | -1 instruction<br>&nbsp;<br>&nbsp;               |
| Less dynamic          | x86-64  | 10 instructions<br>1 branch<br>3 registers     | 6 instructions<br>1 branch<br>1 register       | -4 instructions<br>&nbsp;<br>-2 registers        |
| Less dynamic          | aarch64 | 10 instructions<br>1 branch<br>3 registers     | 8 instructions<br>1 branch<br>2 registers      | -2 instructions<br>&nbsp;<br>-1 register         |
| Equal                 | x86-64  | 7 instructions<br>1 branch<br>2 registers      | 5 instructions<br>1 branch<br>1 register       | -2 instructions<br>&nbsp;<br>-1 register         |
| Equal                 | aarch64 | 8 instructions<br>1 branch<br>2 registers      | 7 instructions<br>1 branch<br>2 registers      | -1 instruction<br>&nbsp;<br>&nbsp;               |
| Equal dynamic         | x86-64  | 8 instructions<br>1 branch<br>3 registers      | 5 instructions<br>1 branch<br>1 register       | -3 instructions<br>&nbsp;<br>-2 registers        |
| Equal dynamic         | aarch64 | 9 instructions<br>1 branch<br>3 registers      | 7 instructions<br>1 branch<br>2 registers      | -2 instructions<br>&nbsp;<br>-1 register         |
| Greater               | x86-64  | 9 instructions<br>2 branches<br>2 registers    | 6 instructions<br>1 branch<br>1 register       | -3 instructions<br>-1 branch<br>-1 register      |
| Greater               | aarch64 | 9 instructions<br>2 branches<br>2 registers    | 8 instructions<br>1 branch<br>2 registers      | -1 instruction<br>-1 branch<br>&nbsp;            |
| Greater dynamic       | x86-64  | 9 instructions<br>2 branches<br>2 registers    | 6 instructions<br>1 branch<br>1 register       | -3 instructions<br>-1 branch<br>-1 register      |
| Greater dynamic       | aarch64 | 9 instructions<br>1 branch<br>2 registers      | 8 instructions<br>1 branch<br>2 registers      | -1 instruction<br>&nbsp;<br>&nbsp;               |

1% perf improvement on all parser benchmarks.
@graphite-app graphite-app Bot force-pushed the om/04-16-perf_allocator_reduce_instructions_on_allocation_hot_path branch from 17997d3 to 3cf7cef Compare April 21, 2026 23:54
@graphite-app graphite-app Bot merged commit 3cf7cef into main Apr 22, 2026
37 checks passed
@graphite-app graphite-app Bot deleted the om/04-16-perf_allocator_reduce_instructions_on_allocation_hot_path branch April 22, 2026 00:03
graphite-app Bot pushed a commit that referenced this pull request Apr 24, 2026
Fix a test for `Bump` which was failing on 32-bit in CI.

#21291 temporarily disabled the test. Turn out, there's no easy way to make this test work on 32-bit platforms, and #21510 added ample unit tests which effectively test the same thing.

So keep it disabled on 32-bit, but fix the test to make it work properly on 64-bit.

Rationale for the fix is explained in fitzgen/bumpalo#320.
camc314 pushed a commit that referenced this pull request Apr 27, 2026
### 💥 BREAKING CHANGES

- 502e804 ast: [**BREAKING**] Reduce size of `TSTypePredicateName`
(#21711) (overlookmotel)
- 5651539 ast: [**BREAKING**] Reduce size of `JSXExpression` (#21710)
(overlookmotel)
- c44e280 ast: [**BREAKING**] Reduce size of `ArrayExpressionElement`
(#21709) (overlookmotel)
- c5b3deb syntax: [**BREAKING**] Remove `CommentNodeId` (#21679)
(overlookmotel)

### 🚀 Features

- b738a39 allocator: Add `Allocator::cursor_ptr` method (#21773)
(overlookmotel)
- 678767e ast: Generate node_id accessors for AST enum wrappers (#21653)
(camc314)
- f091d77 minifier: Inline constant spread elements into arrays (#21095)
(Armano)

### 🐛 Bug Fixes

- 0d608c2 minifier: Preserve raw CR in template literals (#21645)
(Dunqing)
- a889ea9 minifier: Track pure functions in DCE mode (#21722) (Dunqing)
- 674dfac allocator: `Arena` retry allocation when chunk size approaches
maximum (#21777) (overlookmotel)
- f130cc0 allocator: Fix arithmetic overflow in
`Arena::new_chunk_memory_details` (#21745) (overlookmotel)
- b9bf239 allocator: Fix UB in `Arena::grow_zeroed` (#21739)
(overlookmotel)
- d2b9389 allocator: Clippy warning when building without `testing`
feature (#21681) (camc314)
- 503dc86 codegen: Map sourcemaps from visible output starts (#21662)
(Dunqing)
- c92bd3b transformer: Use SPAN for synthesized helper calls to prevent
comment misattribution (#21578) (Dunqing)
- 0d80441 codegen: Add mapping before printing `#` for private ident
(#21619) (camc314)

### ⚡ Performance

- 9fa362e napi/parser: Do not generate tokens except in tests (#21811)
(overlookmotel)
- 0044392 allocator: Reduce branches when allocating new chunk (#21776)
(overlookmotel)
- 7896bd0 allocator: `Allocator::used_bytes` do not use chunk iterator
(#21771) (overlookmotel)
- a5c562f allocator: Remove check in `Arena::new_chunk_memory_details`
(#21750) (overlookmotel)
- 35bbe1f allocator: `Arena` use unchecked size round up where
guaranteed no overflow (#21743) (overlookmotel)
- ffe229b allocator: Remove unnecessary check from
`Arena::try_alloc_layout_slow_impl` (#21732) (overlookmotel)
- 72fece5 allocator: Use `NonNull::offset_from_unsigned` in
`Arena::chunk_capacity` (#21731) (overlookmotel)
- cab32ae ast: Add `#[inline(always)]` to `node_id` methods on enums
with all variants unboxed (#21707) (overlookmotel)
- b179688 parser: Allocate `TriviaBuilder` comments in the arena
(#21512) (Boshen)
- 2290f31 lexer: Fix perf of `Token::set_*` methods on Rust 1.95.0
(#21659) (overlookmotel)
- 1b58029 allocator: Move code into cold path in `Arena::alloc_layout`
(#21622) (overlookmotel)
- 3cf7cef allocator: Reduce instructions on allocation hot path (#21510)
(overlookmotel)

### 📚 Documentation

- ce65070 data_structures: Document why `as_ref` and `as_mut` on
`NonNullConst` and `NonNullMut` take `self` (#21800) (overlookmotel)
- 93b7dbd allocator: Improve doc comments for `ChunkFooter` (#21733)
(overlookmotel)
- 295db8d transformer: Fix comment (#21717) (overlookmotel)
- 5c93af8 ast: Add comments explaining `#[inline(always)]` to `node_id`
methods on enums (#21706) (overlookmotel)
- e4cea25 transform: Use the `node:` namespace in the example (#19998)
(루밀LuMir)

### 🛡️ Security

- d8076c9 deps: Update rolldown (#21639) (renovate)

Co-authored-by: Boshen <1430279+Boshen@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-allocator Area - Allocator C-performance Category - Solution not expected to change functional behavior, only performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants