perf(lexer): utilize jump table for distinguishing tokens (~5% improvement)#16
Closed
Boshen wants to merge 1 commit into
Closed
perf(lexer): utilize jump table for distinguishing tokens (~5% improvement)#16Boshen wants to merge 1 commit into
Boshen wants to merge 1 commit into
Conversation
fece96f to
0427bdd
Compare
Boshen
commented
Feb 12, 2023
Member
Author
|
The Ratel project goes to the extreme of putting callbacks inside the Jump table, I won't do it here because the current approach is already fast enough. I may come back and visit this later. https://github.com/ratel-rust/ratel-core/blob/master/ratel/src/lexer/mod.rs |
Contributor
Parser Benchmark Results |
Member
Author
😕 |
728326a to
42ea38e
Compare
42ea38e to
5fa83fd
Compare
Member
Author
|
Attempt failed, I guess the compiler is smarter than me. |
Boshen
added a commit
that referenced
this pull request
Oct 7, 2025
Replace multi-function calls and multiple enum variant checks with simple range checks, reducing assembly instructions in hot paths. ## Changes **`is_any_keyword()`**: - Before: Called 4 separate functions (`is_reserved_keyword()`, `is_contextual_keyword()`, `is_strict_mode_contextual_keyword()`, `is_future_reserved_keyword()`) checking 70+ enum variants with early returns - After: Single range check `Await..=Yield` since all keywords are contiguous in the enum **`is_number()`**: - Before: Matched 11 separate enum variants - After: Single range check `Decimal..=HexBigInt` since all numeric literals are contiguous ## Assembly Impact Multi-function approach generated **5 instructions** with complex bitmask setup: ```asm mov x8, #992 movk x8, #992, lsl #16 movk x8, #240, lsl #32 lsr x8, x8, x0 and w0, w8, #0x1 ``` Range check generates **4 instructions** with simple arithmetic: ```asm and w8, w0, #0xff sub w8, w8, #5 cmp w8, #39 cset w0, lo ``` ## Performance - `is_any_keyword()` is called from `advance()` on **every single token** - 20% fewer instructions (5 → 4) - Simpler logic enables better branch prediction - Eliminates complex constant loading Added tests to verify enum layout assumptions remain valid. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Boshen
added a commit
that referenced
this pull request
Oct 7, 2025
Replace multi-function calls and multiple enum variant checks with simple range checks, reducing assembly instructions in hot paths. ## Changes **`is_any_keyword()`**: - Before: Called 4 separate functions (`is_reserved_keyword()`, `is_contextual_keyword()`, `is_strict_mode_contextual_keyword()`, `is_future_reserved_keyword()`) checking 70+ enum variants with early returns - After: Single range check `Await..=Yield` since all keywords are contiguous in the enum **`is_number()`**: - Before: Matched 11 separate enum variants - After: Single range check `Decimal..=HexBigInt` since all numeric literals are contiguous ## Assembly Impact Multi-function approach generated **5 instructions** with complex bitmask setup: ```asm mov x8, #992 movk x8, #992, lsl #16 movk x8, #240, lsl #32 lsr x8, x8, x0 and w0, w8, #0x1 ``` Range check generates **4 instructions** with simple arithmetic: ```asm and w8, w0, #0xff sub w8, w8, #5 cmp w8, #39 cset w0, lo ``` ## Performance - `is_any_keyword()` is called from `advance()` on **every single token** - 20% fewer instructions (5 → 4) - Simpler logic enables better branch prediction - Eliminates complex constant loading Added tests to verify enum layout assumptions remain valid. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Boshen
added a commit
that referenced
this pull request
Oct 7, 2025
Replace multi-function calls and multiple enum variant checks with simple range checks, reducing assembly instructions in hot paths. ## Changes **`is_any_keyword()`**: - Before: Called 4 separate functions (`is_reserved_keyword()`, `is_contextual_keyword()`, `is_strict_mode_contextual_keyword()`, `is_future_reserved_keyword()`) checking 70+ enum variants with early returns - After: Single range check `Await..=Yield` since all keywords are contiguous in the enum **`is_number()`**: - Before: Matched 11 separate enum variants - After: Single range check `Decimal..=HexBigInt` since all numeric literals are contiguous ## Assembly Impact Multi-function approach generated **5 instructions** with complex bitmask setup: ```asm mov x8, #992 movk x8, #992, lsl #16 movk x8, #240, lsl #32 lsr x8, x8, x0 and w0, w8, #0x1 ``` Range check generates **4 instructions** with simple arithmetic: ```asm and w8, w0, #0xff sub w8, w8, #5 cmp w8, #39 cset w0, lo ``` ## Performance - `is_any_keyword()` is called from `advance()` on **every single token** - 20% fewer instructions (5 → 4) - Simpler logic enables better branch prediction - Eliminates complex constant loading Added tests to verify enum layout assumptions remain valid. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
graphite-app Bot
pushed a commit
that referenced
this pull request
Oct 7, 2025
…14410) ## Summary Replace multi-function calls and multiple enum variant checks with simple range checks, reducing assembly instructions in hot paths. ## Changes ### `is_any_keyword()` **Before**: Called 4 separate functions checking 70+ enum variants: - `is_reserved_keyword()` - 38 variants - `is_contextual_keyword()` - 39 variants - `is_strict_mode_contextual_keyword()` - 8 variants - `is_future_reserved_keyword()` - 7 variants **After**: Single range check `Await..=Yield` since all keywords are contiguous in the enum ### `is_number()` **Before**: Matched 11 separate enum variants **After**: Single range check `Decimal..=HexBigInt` since numeric literals are contiguous ## Assembly Analysis ### Before (with scattered checks) ```asm mov x8, #992 ; Load bitmask constant movk x8, #992, lsl #16 ; More bitmask setup movk x8, #240, lsl #32 ; Even more bitmask setup lsr x8, x8, x0 ; Shift by kind value and w0, w8, #0x1 ; Extract result bit ``` **5 instructions** with complex constant loading ### After (with range check) ```asm and w8, w0, #0xff ; Extract byte sub w8, w8, #5 ; Subtract range start cmp w8, #39 ; Compare to range size cset w0, lo ; Set result ``` **4 instructions** with simple arithmetic ## Performance Impact - **20% fewer instructions** (5 → 4) - **Simpler logic** = better CPU pipeline utilization - **No complex constants** = smaller code size - **Better branch prediction** with single comparison This is particularly important because: - `is_any_keyword()` is called from `advance()` on **every single token** - This is one of the hottest code paths in the entire parser ## Testing Added unit tests to verify that: - All keywords remain contiguous in the enum (`Await..=Yield`) - All numeric literals remain contiguous (`Decimal..=HexBigInt`) These tests will catch any future enum reordering that would break the optimization. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Boshen
added a commit
that referenced
this pull request
Jan 18, 2026
Cache `ptr` and `chunk_start` fields directly in `Bump` struct to eliminate pointer indirection through `ChunkFooter` in the allocation fast path. Before (2 dependent loads): ```asm ldr x9, [x0, #16] ; Load footer ptr from Bump ldr x8, [x9, #32] ; Load ptr from footer (WAITS for x9!) ``` After (2 independent loads): ```asm ldr x8, [x0] ; Load ptr directly (offset 0) ldr x9, [x0, #8] ; Load chunk_start directly - PARALLEL! ``` This removes the data dependency between loads, allowing ARM to issue both loads in parallel via out-of-order execution. Changes: - Add `ptr` and `chunk_start` cached fields to `Bump` struct - Add `#[repr(C)]` to ensure field ordering for optimal cache access - Update `try_alloc_layout_fast` to use direct field access - Sync cached fields on slow path (new chunk allocation) and iteration - Update helper methods to use cached ptr Size impact: `Bump` grows from 24 to 40 bytes - acceptable tradeoff for the hot path optimization. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
graphite-app Bot
pushed a commit
that referenced
this pull request
Jun 11, 2026
## Summary - Change `Codegen::wrap` from `FnMut` to `FnOnce`, matching how the helper is used: every closure passed to `wrap` is invoked exactly once. - Add an assembly comparison note showing the optimized codegen difference before and after the change. ## `Fn`, `FnMut`, and `FnOnce` ? `Fn`, `FnMut`, and `FnOnce` all describe how a closure may be called: - `Fn` is the most restrictive for the closure body: it is callable through `&self`, can be called repeatedly, and cannot require mutable or consuming access to captured state. - `FnMut` is callable through `&mut self`, can be called repeatedly, and may mutate captured state. - `FnOnce` is callable by value, may consume captured state, and is only guaranteed to be callable once. The trait relationship goes from most specific to most general call capability: ```rust Fn: FnMut FnMut: FnOnce ``` So accepting `FnOnce` is the least restrictive bound for a callback that is only invoked once. It still accepts `Fn` and `FnMut` closures, but it also tells the optimizer that `wrap` does not need a reusable mutable closure object. ## Assembly Impact `Codegen::wrap` is an inline generic helper, so there is no stable standalone `wrap` assembly symbol in release output. The impact shows up in monomorphized call sites such as `Class::gen`, `Function::gen`, and expression `gen_expr` closures. Before, several call sites materialized a closure environment on the stack before calling the closure: ```asm strb w9, [sp, #15] add x9, sp, #15 stp x0, x9, [sp, #16] add x0, sp, #16 bl <...>::{{closure}} ``` After the `FnOnce` bound, the same shape can pass the one-shot closure state directly through registers: ```asm and w1, w2, #0xfffffffd mov x2, x19 bl <...>::{{closure}} ``` Some wrapper frames also shrink. For example, representative `Class::gen` / `Function::gen` paths go from a `64` byte frame to a `48` byte frame: ```asm - sub sp, sp, #64 - stp x20, x19, [sp, #32] - stp x29, x30, [sp, #48] + sub sp, sp, #48 + stp x20, x19, [sp, #16] + stp x29, x30, [sp, #32] ``` Several restored-frame return paths also become tail calls: ```asm ldp x29, x30, [sp, #32] ldp x20, x19, [sp, #16] add sp, sp, #48 b <closure or push_slow target> ``` The assembly diff also contains expected local label renumbering noise, such as switch-table suffixes changing from `.318` to `.330`; those are not behavior changes.
graphite-app Bot
pushed a commit
that referenced
this pull request
Jun 25, 2026
This avoids calling `Expression::span()` when wrapping an expression-bodied arrow function in a synthetic `ExpressionStatement`. The parser already knows the expression body starts at the current token before parsing and ends at `prev_token_end` afterwards, so this can use `start_span()` / `end_span()` directly. That keeps the returned `Expression` in registers instead of materializing it on the stack just to call the generated `GetSpan` impl. Before, the optimized parser assembly spilled the returned expression and called `Expression::span()`: ```asm bl parse_assignment_expression_or_higher_impl mov x20, x0 mov x22, x1 strh w21, [x19, #1196] strb w0, [sp, #16] str x1, [sp, #24] add x0, sp, #16 bl GetSpan_for_Expression_span ... str x0, [x21] ; ExpressionStatement span strb w20, [x21, #16] ; Expression tag str x22, [x21, #24] ; Expression payload ``` After, the span is built from parser token state and the expression result is written directly into the arena allocation: ```asm ldr x20, [x0, #816] ; current token span before parse ... bl parse_assignment_expression_or_higher_impl strh w21, [x19, #1196] ldr w8, [x19, #1192] ; previous token end after parse bfi x20, x8, #32, #32 ... str x20, [x21] ; ExpressionStatement span strb w0, [x21, #16] ; Expression tag str x1, [x21, #24] ; Expression payload ``` This also reduced the first monomorphized parse_arrow_function_expression_body stack frame from 432 bytes to 416 bytes in my local release assembly.
camc314
added a commit
that referenced
this pull request
Jul 3, 2026
## Summary - Change `Codegen::wrap` from `FnMut` to `FnOnce`, matching how the helper is used: every closure passed to `wrap` is invoked exactly once. - Add an assembly comparison note showing the optimized codegen difference before and after the change. ## `Fn`, `FnMut`, and `FnOnce` ? `Fn`, `FnMut`, and `FnOnce` all describe how a closure may be called: - `Fn` is the most restrictive for the closure body: it is callable through `&self`, can be called repeatedly, and cannot require mutable or consuming access to captured state. - `FnMut` is callable through `&mut self`, can be called repeatedly, and may mutate captured state. - `FnOnce` is callable by value, may consume captured state, and is only guaranteed to be callable once. The trait relationship goes from most specific to most general call capability: ```rust Fn: FnMut FnMut: FnOnce ``` So accepting `FnOnce` is the least restrictive bound for a callback that is only invoked once. It still accepts `Fn` and `FnMut` closures, but it also tells the optimizer that `wrap` does not need a reusable mutable closure object. ## Assembly Impact `Codegen::wrap` is an inline generic helper, so there is no stable standalone `wrap` assembly symbol in release output. The impact shows up in monomorphized call sites such as `Class::gen`, `Function::gen`, and expression `gen_expr` closures. Before, several call sites materialized a closure environment on the stack before calling the closure: ```asm strb w9, [sp, #15] add x9, sp, #15 stp x0, x9, [sp, #16] add x0, sp, #16 bl <...>::{{closure}} ``` After the `FnOnce` bound, the same shape can pass the one-shot closure state directly through registers: ```asm and w1, w2, #0xfffffffd mov x2, x19 bl <...>::{{closure}} ``` Some wrapper frames also shrink. For example, representative `Class::gen` / `Function::gen` paths go from a `64` byte frame to a `48` byte frame: ```asm - sub sp, sp, #64 - stp x20, x19, [sp, #32] - stp x29, x30, [sp, #48] + sub sp, sp, #48 + stp x20, x19, [sp, #16] + stp x29, x30, [sp, #32] ``` Several restored-frame return paths also become tail calls: ```asm ldp x29, x30, [sp, #32] ldp x20, x19, [sp, #16] add sp, sp, #48 b <closure or push_slow target> ``` The assembly diff also contains expected local label renumbering noise, such as switch-table suffixes changing from `.318` to `.330`; those are not behavior changes.
camc314
added a commit
that referenced
this pull request
Jul 3, 2026
This avoids calling `Expression::span()` when wrapping an expression-bodied arrow function in a synthetic `ExpressionStatement`. The parser already knows the expression body starts at the current token before parsing and ends at `prev_token_end` afterwards, so this can use `start_span()` / `end_span()` directly. That keeps the returned `Expression` in registers instead of materializing it on the stack just to call the generated `GetSpan` impl. Before, the optimized parser assembly spilled the returned expression and called `Expression::span()`: ```asm bl parse_assignment_expression_or_higher_impl mov x20, x0 mov x22, x1 strh w21, [x19, #1196] strb w0, [sp, #16] str x1, [sp, #24] add x0, sp, #16 bl GetSpan_for_Expression_span ... str x0, [x21] ; ExpressionStatement span strb w20, [x21, #16] ; Expression tag str x22, [x21, #24] ; Expression payload ``` After, the span is built from parser token state and the expression result is written directly into the arena allocation: ```asm ldr x20, [x0, #816] ; current token span before parse ... bl parse_assignment_expression_or_higher_impl strh w21, [x19, #1196] ldr w8, [x19, #1192] ; previous token end after parse bfi x20, x8, #32, #32 ... str x20, [x21] ; ExpressionStatement span strb w0, [x21, #16] ; Expression tag str x1, [x21, #24] ; Expression payload ``` This also reduced the first monomorphized parse_arrow_function_expression_body stack frame from 432 bytes to 416 bytes in my local release assembly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I had a gut feeling this will make it faster but I don't know the precise reason.
Something along the lines of LLVM / branch prediction / cache friendliness ...