perf(allocator): store pointers directly in Arena#21483
perf(allocator): store pointers directly in Arena#21483graphite-app[bot] merged 1 commit intomainfrom
Arena#21483Conversation
How to use the Graphite Merge QueueAdd either label to this PR to merge it via the merge queue:
You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has enabled the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
Merging this PR will not alter performance
Comparing Footnotes
|
There was a problem hiding this comment.
Pull request overview
This PR optimizes oxc_allocator::arena::Arena’s fast-path allocations by storing the current chunk’s bump cursor and start pointer directly on Arena, avoiding repeated indirection through the ChunkFooter (and repeated loads through a Cell).
Changes:
- Added
Arena::cursor_ptrandArena::start_ptrfields (bothCell<NonNull<u8>>) and initialized/maintained them across arena lifecycle operations. - Repurposed
ChunkFooter::cursor_ptrto store the final cursor only for retired chunks (to support chunk iteration APIs). - Updated chunk iteration to source the current chunk’s cursor from
Arenawhile continuing to read retired chunk cursors from each footer.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| crates/oxc_allocator/src/arena/mod.rs | Adds cursor_ptr/start_ptr to Arena, adjusts ChunkFooter fields, removes ChunkFooter::as_raw_parts. |
| crates/oxc_allocator/src/arena/from_raw_parts.rs | Updates raw-transfer construction and cursor mutation to use Arena::cursor_ptr. |
| crates/oxc_allocator/src/arena/drop.rs | Updates reset() to restore the arena-level cursor pointer. |
| crates/oxc_allocator/src/arena/create.rs | Initializes arena-level start_ptr/cursor_ptr in new_impl, updates footer construction. |
| crates/oxc_allocator/src/arena/chunks.rs | Uses arena-level cursor for current chunk iteration; stores it in the iterator for the first step. |
| crates/oxc_allocator/src/arena/alloc_impl.rs | Moves fast-path allocation reads/writes to arena-level pointers; syncs retiring chunk cursor back into its footer. |
8dae04d to
2df9f19
Compare
2df9f19 to
bc3f364
Compare
bc3f364 to
421a549
Compare
Merge activity
|
`Arena` stores pointers to current bump cursor, and start of the chunk in `ChunkFooter`. This means that every allocation involves pointer chasing - read the pointer to the`ChunkFooter`, then read the `ChunkFooter`'s `cursor_ptr` and `start_ptr` fields. Because the pointer to the `ChunkFooter` is wrapped in a `Cell`, compiler likely cannot assume the value is still what it was last time it read the field, and will read it over and over repeatedly. This adds ~4 cycles of latency to every allocation. Instead, store these pointers as fields of the `Arena` itself, to avoid this indirection. `start_ptr` still also needs to be stored in `ChunkFooter` for use when deallocating chunks. And `cursor_ptr` is also stored in `ChunkFooter` to support `iter_allocated_chunks` and `iter_allocated_chunks_raw` methods. 0.3% - 0.5% perf improvement in parser benchmarks. Allocation is so fast already, that the impact is small - allocation is not the bottleneck. But a micro-benchmark testing allocation in isolation shows that allocation itself gets a 2x speed-up. See comment on `Arena` for details of a field layout oddity which has a huge impact on aarch64 (Apple Silicon).
421a549 to
be2b392
Compare
### 🐛 Bug Fixes - 48967e8 isolated_declarations: Drop required type check for private parameter properties on private constructors (#21515) (Dunqing) - 91e5bde transformer/typescript: Preserve computed-key static block when class has an empty constructor (#21562) (Dunqing) - 50e9d26 mangler: Assign correct slot to shadowed function-expression names (#21535) (Dunqing) - 065ce47 isolated_declarations: Collect types from private accessors for paired inference (#21516) (Dunqing) - 00fc136 codegen: Preserve coverage comments before object properties (#21312) (bab) - d676e0c minifier: Mark LHS of `??=` as read when converting from `== null &&` (#21546) (Gunnlaugur Thor Briem) ### ⚡ Performance - e45efc5 parser: Reduce `try_parse` usage in favour of `lookahead` (#21532) (Boshen) - ddb1bf8 parser: Avoid redundant `IdentifierReference` clone in shorthand property (#21511) (Boshen) - be2b392 allocator: Store pointers directly in `Arena` (#21483) (overlookmotel) Co-authored-by: Boshen <1430279+Boshen@users.noreply.github.com> Co-authored-by: Cameron <cameron.clark@hey.com>
The chunks in an `Arena` form a linked list. Previously the list was terminated with a canonical empty chunk, defined as a `static`. Now that we store `start_ptr` and `cursor_ptr` in `Arena` itself (#21483), an empty `Arena` doesn't need to have a pointer to a chunk. Remove the empty chunk, and use `None` to signify the end of the linked list instead. This makes checking "is this chunk the last?" a little cheaper (comparison to 0, not a 64-bit static value), and feels less hacky, and more explicit. It also removes the potential hazard of accidentally mutating the immutable `static` empty chunk.

Arenastores pointers to current bump cursor, and start of the chunk inChunkFooter.This means that every allocation involves pointer chasing - read the pointer to the
ChunkFooter, then read theChunkFooter'scursor_ptrandstart_ptrfields. Because the pointer to theChunkFooteris wrapped in aCell, compiler likely cannot assume the value is still what it was last time it read the field, and will read it over and over repeatedly. This adds ~4 cycles of latency to every allocation.Instead, store these pointers as fields of the
Arenaitself, to avoid this indirection.start_ptrstill also needs to be stored inChunkFooterfor use when deallocating chunks. Andcursor_ptris also stored inChunkFooterto supportiter_allocated_chunksanditer_allocated_chunks_rawmethods.0.3% - 0.5% perf improvement in parser benchmarks. Allocation is so fast already, that the impact is small - allocation is not the bottleneck. But a micro-benchmark testing allocation in isolation shows that allocation itself gets a 2x speed-up.
See comment on
Arenafor details of a field layout oddity which has a huge impact on aarch64 (Apple Silicon).