perf(lexer): utilize jump table for distinguishing tokens (~5% improvement) by Boshen · Pull Request #16 · oxc-project/oxc

Boshen · 2023-02-12T12:03:47Z

I had a gut feeling this will make it faster but I don't know the precise reason.

Something along the lines of LLVM / branch prediction / cache friendliness ...

Boshen · 2023-02-12T12:11:14Z

The Ratel project goes to the extreme of putting callbacks inside the Jump table, I won't do it here because the current approach is already fast enough. I may come back and visit this later.

https://github.com/ratel-rust/ratel-core/blob/master/ratel/src/lexer/mod.rs

github-actions · 2023-02-12T12:15:48Z

Parser Benchmark Results

group                    main                                   pr
-----                    ----                                   --
parser/babylon.max.js    1.00    138.1±0.27ms    74.8 MB/sec    1.00    138.5±0.32ms    74.6 MB/sec
parser/d3.js             1.00     16.3±0.03ms    33.4 MB/sec    1.01     16.5±0.07ms    33.2 MB/sec
parser/lodash.js         1.00      5.7±0.06ms    89.8 MB/sec    1.00      5.7±0.06ms    90.0 MB/sec
parser/pdf.js            1.00      9.4±0.01ms    42.8 MB/sec    1.02      9.6±0.03ms    42.1 MB/sec
parser/typescript.js     1.00    134.8±0.25ms    71.4 MB/sec    1.03    138.3±0.58ms    69.6 MB/sec

Boshen · 2023-02-12T12:20:08Z

Parser Benchmark Results

group                    main                                   pr
-----                    ----                                   --
parser/babylon.max.js    1.00    139.5±0.32ms    74.0 MB/sec    1.01    141.4±0.38ms    73.0 MB/sec
parser/d3.js             1.00     16.6±0.17ms    33.0 MB/sec    1.01     16.8±0.06ms    32.5 MB/sec
parser/lodash.js         1.00      5.8±0.07ms    89.1 MB/sec    1.01      5.8±0.09ms    88.5 MB/sec
parser/pdf.js            1.00      9.5±0.03ms    42.2 MB/sec    1.02      9.7±0.03ms    41.4 MB/sec
parser/typescript.js     1.00    136.6±0.29ms    70.4 MB/sec    1.03    140.0±0.23ms    68.7 MB/sec

😕

…ement)

Boshen · 2023-02-12T13:13:31Z

Attempt failed, I guess the compiler is smarter than me.

Replace multi-function calls and multiple enum variant checks with simple range checks, reducing assembly instructions in hot paths. ## Changes **`is_any_keyword()`**: - Before: Called 4 separate functions (`is_reserved_keyword()`, `is_contextual_keyword()`, `is_strict_mode_contextual_keyword()`, `is_future_reserved_keyword()`) checking 70+ enum variants with early returns - After: Single range check `Await..=Yield` since all keywords are contiguous in the enum **`is_number()`**: - Before: Matched 11 separate enum variants - After: Single range check `Decimal..=HexBigInt` since all numeric literals are contiguous ## Assembly Impact Multi-function approach generated **5 instructions** with complex bitmask setup: ```asm mov x8, #992 movk x8, #992, lsl #16 movk x8, #240, lsl #32 lsr x8, x8, x0 and w0, w8, #0x1 ``` Range check generates **4 instructions** with simple arithmetic: ```asm and w8, w0, #0xff sub w8, w8, #5 cmp w8, #39 cset w0, lo ``` ## Performance - `is_any_keyword()` is called from `advance()` on **every single token** - 20% fewer instructions (5 → 4) - Simpler logic enables better branch prediction - Eliminates complex constant loading Added tests to verify enum layout assumptions remain valid. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…14410) ## Summary Replace multi-function calls and multiple enum variant checks with simple range checks, reducing assembly instructions in hot paths. ## Changes ### `is_any_keyword()` **Before**: Called 4 separate functions checking 70+ enum variants: - `is_reserved_keyword()` - 38 variants - `is_contextual_keyword()` - 39 variants - `is_strict_mode_contextual_keyword()` - 8 variants - `is_future_reserved_keyword()` - 7 variants **After**: Single range check `Await..=Yield` since all keywords are contiguous in the enum ### `is_number()` **Before**: Matched 11 separate enum variants **After**: Single range check `Decimal..=HexBigInt` since numeric literals are contiguous ## Assembly Analysis ### Before (with scattered checks) ```asm mov x8, #992 ; Load bitmask constant movk x8, #992, lsl #16 ; More bitmask setup movk x8, #240, lsl #32 ; Even more bitmask setup lsr x8, x8, x0 ; Shift by kind value and w0, w8, #0x1 ; Extract result bit ``` **5 instructions** with complex constant loading ### After (with range check) ```asm and w8, w0, #0xff ; Extract byte sub w8, w8, #5 ; Subtract range start cmp w8, #39 ; Compare to range size cset w0, lo ; Set result ``` **4 instructions** with simple arithmetic ## Performance Impact - **20% fewer instructions** (5 → 4) - **Simpler logic** = better CPU pipeline utilization - **No complex constants** = smaller code size - **Better branch prediction** with single comparison This is particularly important because: - `is_any_keyword()` is called from `advance()` on **every single token** - This is one of the hottest code paths in the entire parser ## Testing Added unit tests to verify that: - All keywords remain contiguous in the enum (`Await..=Yield`) - All numeric literals remain contiguous (`Decimal..=HexBigInt`) These tests will catch any future enum reordering that would break the optimization. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Cache `ptr` and `chunk_start` fields directly in `Bump` struct to eliminate pointer indirection through `ChunkFooter` in the allocation fast path. Before (2 dependent loads): ```asm ldr x9, [x0, #16] ; Load footer ptr from Bump ldr x8, [x9, #32] ; Load ptr from footer (WAITS for x9!) ``` After (2 independent loads): ```asm ldr x8, [x0] ; Load ptr directly (offset 0) ldr x9, [x0, #8] ; Load chunk_start directly - PARALLEL! ``` This removes the data dependency between loads, allowing ARM to issue both loads in parallel via out-of-order execution. Changes: - Add `ptr` and `chunk_start` cached fields to `Bump` struct - Add `#[repr(C)]` to ensure field ordering for optimal cache access - Update `try_alloc_layout_fast` to use direct field access - Sync cached fields on slow path (new chunk allocation) and iteration - Update helper methods to use cached ptr Size impact: `Bump` grows from 24 to 40 bytes - acceptable tradeoff for the hot path optimization. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

## Summary - Change `Codegen::wrap` from `FnMut` to `FnOnce`, matching how the helper is used: every closure passed to `wrap` is invoked exactly once. - Add an assembly comparison note showing the optimized codegen difference before and after the change. ## `Fn`, `FnMut`, and `FnOnce` ? `Fn`, `FnMut`, and `FnOnce` all describe how a closure may be called: - `Fn` is the most restrictive for the closure body: it is callable through `&self`, can be called repeatedly, and cannot require mutable or consuming access to captured state. - `FnMut` is callable through `&mut self`, can be called repeatedly, and may mutate captured state. - `FnOnce` is callable by value, may consume captured state, and is only guaranteed to be callable once. The trait relationship goes from most specific to most general call capability: ```rust Fn: FnMut FnMut: FnOnce ``` So accepting `FnOnce` is the least restrictive bound for a callback that is only invoked once. It still accepts `Fn` and `FnMut` closures, but it also tells the optimizer that `wrap` does not need a reusable mutable closure object. ## Assembly Impact `Codegen::wrap` is an inline generic helper, so there is no stable standalone `wrap` assembly symbol in release output. The impact shows up in monomorphized call sites such as `Class::gen`, `Function::gen`, and expression `gen_expr` closures. Before, several call sites materialized a closure environment on the stack before calling the closure: ```asm strb w9, [sp, #15] add x9, sp, #15 stp x0, x9, [sp, #16] add x0, sp, #16 bl <...>::{{closure}} ``` After the `FnOnce` bound, the same shape can pass the one-shot closure state directly through registers: ```asm and w1, w2, #0xfffffffd mov x2, x19 bl <...>::{{closure}} ``` Some wrapper frames also shrink. For example, representative `Class::gen` / `Function::gen` paths go from a `64` byte frame to a `48` byte frame: ```asm - sub sp, sp, #64 - stp x20, x19, [sp, #32] - stp x29, x30, [sp, #48] + sub sp, sp, #48 + stp x20, x19, [sp, #16] + stp x29, x30, [sp, #32] ``` Several restored-frame return paths also become tail calls: ```asm ldp x29, x30, [sp, #32] ldp x20, x19, [sp, #16] add sp, sp, #48 b <closure or push_slow target> ``` The assembly diff also contains expected local label renumbering noise, such as switch-table suffixes changing from `.318` to `.330`; those are not behavior changes.

This avoids calling `Expression::span()` when wrapping an expression-bodied arrow function in a synthetic `ExpressionStatement`. The parser already knows the expression body starts at the current token before parsing and ends at `prev_token_end` afterwards, so this can use `start_span()` / `end_span()` directly. That keeps the returned `Expression` in registers instead of materializing it on the stack just to call the generated `GetSpan` impl. Before, the optimized parser assembly spilled the returned expression and called `Expression::span()`: ```asm bl parse_assignment_expression_or_higher_impl mov x20, x0 mov x22, x1 strh w21, [x19, #1196] strb w0, [sp, #16] str x1, [sp, #24] add x0, sp, #16 bl GetSpan_for_Expression_span ... str x0, [x21] ; ExpressionStatement span strb w20, [x21, #16] ; Expression tag str x22, [x21, #24] ; Expression payload ``` After, the span is built from parser token state and the expression result is written directly into the arena allocation: ```asm ldr x20, [x0, #816] ; current token span before parse ... bl parse_assignment_expression_or_higher_impl strh w21, [x19, #1196] ldr w8, [x19, #1192] ; previous token end after parse bfi x20, x8, #32, #32 ... str x20, [x21] ; ExpressionStatement span strb w0, [x21, #16] ; Expression tag str x1, [x21, #24] ; Expression payload ``` This also reduced the first monomorphized parse_arrow_function_expression_body stack frame from 432 bytes to 416 bytes in my local release assembly.

## Summary - Change `Codegen::wrap` from `FnMut` to `FnOnce`, matching how the helper is used: every closure passed to `wrap` is invoked exactly once. - Add an assembly comparison note showing the optimized codegen difference before and after the change. ## `Fn`, `FnMut`, and `FnOnce` ? `Fn`, `FnMut`, and `FnOnce` all describe how a closure may be called: - `Fn` is the most restrictive for the closure body: it is callable through `&self`, can be called repeatedly, and cannot require mutable or consuming access to captured state. - `FnMut` is callable through `&mut self`, can be called repeatedly, and may mutate captured state. - `FnOnce` is callable by value, may consume captured state, and is only guaranteed to be callable once. The trait relationship goes from most specific to most general call capability: ```rust Fn: FnMut FnMut: FnOnce ``` So accepting `FnOnce` is the least restrictive bound for a callback that is only invoked once. It still accepts `Fn` and `FnMut` closures, but it also tells the optimizer that `wrap` does not need a reusable mutable closure object. ## Assembly Impact `Codegen::wrap` is an inline generic helper, so there is no stable standalone `wrap` assembly symbol in release output. The impact shows up in monomorphized call sites such as `Class::gen`, `Function::gen`, and expression `gen_expr` closures. Before, several call sites materialized a closure environment on the stack before calling the closure: ```asm strb w9, [sp, #15] add x9, sp, #15 stp x0, x9, [sp, #16] add x0, sp, #16 bl <...>::{{closure}} ``` After the `FnOnce` bound, the same shape can pass the one-shot closure state directly through registers: ```asm and w1, w2, #0xfffffffd mov x2, x19 bl <...>::{{closure}} ``` Some wrapper frames also shrink. For example, representative `Class::gen` / `Function::gen` paths go from a `64` byte frame to a `48` byte frame: ```asm - sub sp, sp, #64 - stp x20, x19, [sp, #32] - stp x29, x30, [sp, #48] + sub sp, sp, #48 + stp x20, x19, [sp, #16] + stp x29, x30, [sp, #32] ``` Several restored-frame return paths also become tail calls: ```asm ldp x29, x30, [sp, #32] ldp x20, x19, [sp, #16] add sp, sp, #48 b <closure or push_slow target> ``` The assembly diff also contains expected local label renumbering noise, such as switch-table suffixes changing from `.318` to `.330`; those are not behavior changes.

This avoids calling `Expression::span()` when wrapping an expression-bodied arrow function in a synthetic `ExpressionStatement`. The parser already knows the expression body starts at the current token before parsing and ends at `prev_token_end` afterwards, so this can use `start_span()` / `end_span()` directly. That keeps the returned `Expression` in registers instead of materializing it on the stack just to call the generated `GetSpan` impl. Before, the optimized parser assembly spilled the returned expression and called `Expression::span()`: ```asm bl parse_assignment_expression_or_higher_impl mov x20, x0 mov x22, x1 strh w21, [x19, #1196] strb w0, [sp, #16] str x1, [sp, #24] add x0, sp, #16 bl GetSpan_for_Expression_span ... str x0, [x21] ; ExpressionStatement span strb w20, [x21, #16] ; Expression tag str x22, [x21, #24] ; Expression payload ``` After, the span is built from parser token state and the expression result is written directly into the arena allocation: ```asm ldr x20, [x0, #816] ; current token span before parse ... bl parse_assignment_expression_or_higher_impl strh w21, [x19, #1196] ldr w8, [x19, #1192] ; previous token end after parse bfi x20, x8, #32, #32 ... str x20, [x21] ; ExpressionStatement span strb w0, [x21, #16] ; Expression tag str x1, [x21, #24] ; Expression payload ``` This also reduced the first monomorphized parse_arrow_function_expression_body stack frame from 432 bytes to 416 bytes in my local release assembly.

Boshen force-pushed the lexer-jump-table branch from fece96f to 0427bdd Compare February 12, 2023 12:05

Boshen commented Feb 12, 2023

View reviewed changes

Comment thread crates/oxc_parser/src/lexer/mod.rs

Boshen force-pushed the lexer-jump-table branch 2 times, most recently from 728326a to 42ea38e Compare February 12, 2023 12:46

perf(lexer): utilize jump table for distinguishing tokens (~5% improv…

5fa83fd

…ement)

Boshen force-pushed the lexer-jump-table branch from 42ea38e to 5fa83fd Compare February 12, 2023 13:01

Boshen closed this Feb 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

perf(lexer): utilize jump table for distinguishing tokens (~5% improvement)#16

perf(lexer): utilize jump table for distinguishing tokens (~5% improvement)#16
Boshen wants to merge 1 commit into
mainfrom
lexer-jump-table

Boshen commented Feb 12, 2023

Uh oh!

Uh oh!

Boshen commented Feb 12, 2023

Uh oh!

github-actions Bot commented Feb 12, 2023 •

edited

Loading

Uh oh!

Boshen commented Feb 12, 2023

Parser Benchmark Results

Uh oh!

Boshen commented Feb 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

Boshen commented Feb 12, 2023

Uh oh!

Uh oh!

Boshen commented Feb 12, 2023

Uh oh!

github-actions Bot commented Feb 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parser Benchmark Results

Uh oh!

Boshen commented Feb 12, 2023

Parser Benchmark Results

Uh oh!

Boshen commented Feb 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Feb 12, 2023 •

edited

Loading