Skip to content

perf(lexer): utilize jump table for distinguishing tokens (~5% improvement)#16

Closed
Boshen wants to merge 1 commit into
mainfrom
lexer-jump-table
Closed

perf(lexer): utilize jump table for distinguishing tokens (~5% improvement)#16
Boshen wants to merge 1 commit into
mainfrom
lexer-jump-table

Conversation

@Boshen

@Boshen Boshen commented Feb 12, 2023

Copy link
Copy Markdown
Member

I had a gut feeling this will make it faster but I don't know the precise reason.

Something along the lines of LLVM / branch prediction / cache friendliness ...

Comment thread crates/oxc_parser/src/lexer/mod.rs
@Boshen

Boshen commented Feb 12, 2023

Copy link
Copy Markdown
Member Author

The Ratel project goes to the extreme of putting callbacks inside the Jump table, I won't do it here because the current approach is already fast enough. I may come back and visit this later.

https://github.com/ratel-rust/ratel-core/blob/master/ratel/src/lexer/mod.rs

@github-actions

github-actions Bot commented Feb 12, 2023

Copy link
Copy Markdown
Contributor

Parser Benchmark Results

group                    main                                   pr
-----                    ----                                   --
parser/babylon.max.js    1.00    138.1±0.27ms    74.8 MB/sec    1.00    138.5±0.32ms    74.6 MB/sec
parser/d3.js             1.00     16.3±0.03ms    33.4 MB/sec    1.01     16.5±0.07ms    33.2 MB/sec
parser/lodash.js         1.00      5.7±0.06ms    89.8 MB/sec    1.00      5.7±0.06ms    90.0 MB/sec
parser/pdf.js            1.00      9.4±0.01ms    42.8 MB/sec    1.02      9.6±0.03ms    42.1 MB/sec
parser/typescript.js     1.00    134.8±0.25ms    71.4 MB/sec    1.03    138.3±0.58ms    69.6 MB/sec

@Boshen

Boshen commented Feb 12, 2023

Copy link
Copy Markdown
Member Author

Parser Benchmark Results

group                    main                                   pr
-----                    ----                                   --
parser/babylon.max.js    1.00    139.5±0.32ms    74.0 MB/sec    1.01    141.4±0.38ms    73.0 MB/sec
parser/d3.js             1.00     16.6±0.17ms    33.0 MB/sec    1.01     16.8±0.06ms    32.5 MB/sec
parser/lodash.js         1.00      5.8±0.07ms    89.1 MB/sec    1.01      5.8±0.09ms    88.5 MB/sec
parser/pdf.js            1.00      9.5±0.03ms    42.2 MB/sec    1.02      9.7±0.03ms    41.4 MB/sec
parser/typescript.js     1.00    136.6±0.29ms    70.4 MB/sec    1.03    140.0±0.23ms    68.7 MB/sec

😕

@Boshen Boshen force-pushed the lexer-jump-table branch 2 times, most recently from 728326a to 42ea38e Compare February 12, 2023 12:46
@Boshen

Boshen commented Feb 12, 2023

Copy link
Copy Markdown
Member Author

Attempt failed, I guess the compiler is smarter than me.

@Boshen Boshen closed this Feb 12, 2023
Boshen added a commit that referenced this pull request Oct 7, 2025
Replace multi-function calls and multiple enum variant checks with simple range checks, reducing assembly instructions in hot paths.

## Changes

**`is_any_keyword()`**:
- Before: Called 4 separate functions (`is_reserved_keyword()`, `is_contextual_keyword()`, `is_strict_mode_contextual_keyword()`, `is_future_reserved_keyword()`) checking 70+ enum variants with early returns
- After: Single range check `Await..=Yield` since all keywords are contiguous in the enum

**`is_number()`**:
- Before: Matched 11 separate enum variants
- After: Single range check `Decimal..=HexBigInt` since all numeric literals are contiguous

## Assembly Impact

Multi-function approach generated **5 instructions** with complex bitmask setup:
```asm
mov   x8, #992
movk  x8, #992, lsl #16
movk  x8, #240, lsl #32
lsr   x8, x8, x0
and   w0, w8, #0x1
```

Range check generates **4 instructions** with simple arithmetic:
```asm
and   w8, w0, #0xff
sub   w8, w8, #5
cmp   w8, #39
cset  w0, lo
```

## Performance

- `is_any_keyword()` is called from `advance()` on **every single token**
- 20% fewer instructions (5 → 4)
- Simpler logic enables better branch prediction
- Eliminates complex constant loading

Added tests to verify enum layout assumptions remain valid.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Boshen added a commit that referenced this pull request Oct 7, 2025
Replace multi-function calls and multiple enum variant checks with simple range checks, reducing assembly instructions in hot paths.

## Changes

**`is_any_keyword()`**:
- Before: Called 4 separate functions (`is_reserved_keyword()`, `is_contextual_keyword()`, `is_strict_mode_contextual_keyword()`, `is_future_reserved_keyword()`) checking 70+ enum variants with early returns
- After: Single range check `Await..=Yield` since all keywords are contiguous in the enum

**`is_number()`**:
- Before: Matched 11 separate enum variants
- After: Single range check `Decimal..=HexBigInt` since all numeric literals are contiguous

## Assembly Impact

Multi-function approach generated **5 instructions** with complex bitmask setup:
```asm
mov   x8, #992
movk  x8, #992, lsl #16
movk  x8, #240, lsl #32
lsr   x8, x8, x0
and   w0, w8, #0x1
```

Range check generates **4 instructions** with simple arithmetic:
```asm
and   w8, w0, #0xff
sub   w8, w8, #5
cmp   w8, #39
cset  w0, lo
```

## Performance

- `is_any_keyword()` is called from `advance()` on **every single token**
- 20% fewer instructions (5 → 4)
- Simpler logic enables better branch prediction
- Eliminates complex constant loading

Added tests to verify enum layout assumptions remain valid.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Boshen added a commit that referenced this pull request Oct 7, 2025
Replace multi-function calls and multiple enum variant checks with simple range checks, reducing assembly instructions in hot paths.

## Changes

**`is_any_keyword()`**:
- Before: Called 4 separate functions (`is_reserved_keyword()`, `is_contextual_keyword()`, `is_strict_mode_contextual_keyword()`, `is_future_reserved_keyword()`) checking 70+ enum variants with early returns
- After: Single range check `Await..=Yield` since all keywords are contiguous in the enum

**`is_number()`**:
- Before: Matched 11 separate enum variants
- After: Single range check `Decimal..=HexBigInt` since all numeric literals are contiguous

## Assembly Impact

Multi-function approach generated **5 instructions** with complex bitmask setup:
```asm
mov   x8, #992
movk  x8, #992, lsl #16
movk  x8, #240, lsl #32
lsr   x8, x8, x0
and   w0, w8, #0x1
```

Range check generates **4 instructions** with simple arithmetic:
```asm
and   w8, w0, #0xff
sub   w8, w8, #5
cmp   w8, #39
cset  w0, lo
```

## Performance

- `is_any_keyword()` is called from `advance()` on **every single token**
- 20% fewer instructions (5 → 4)
- Simpler logic enables better branch prediction
- Eliminates complex constant loading

Added tests to verify enum layout assumptions remain valid.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
graphite-app Bot pushed a commit that referenced this pull request Oct 7, 2025
…14410)

## Summary

Replace multi-function calls and multiple enum variant checks with simple range checks, reducing assembly instructions in hot paths.

## Changes

### `is_any_keyword()`
**Before**: Called 4 separate functions checking 70+ enum variants:
- `is_reserved_keyword()` - 38 variants
- `is_contextual_keyword()` - 39 variants
- `is_strict_mode_contextual_keyword()` - 8 variants
- `is_future_reserved_keyword()` - 7 variants

**After**: Single range check `Await..=Yield` since all keywords are contiguous in the enum

### `is_number()`
**Before**: Matched 11 separate enum variants
**After**: Single range check `Decimal..=HexBigInt` since numeric literals are contiguous

## Assembly Analysis

### Before (with scattered checks)
```asm
mov   x8, #992              ; Load bitmask constant
movk  x8, #992, lsl #16     ; More bitmask setup
movk  x8, #240, lsl #32     ; Even more bitmask setup
lsr   x8, x8, x0            ; Shift by kind value
and   w0, w8, #0x1          ; Extract result bit
```
**5 instructions** with complex constant loading

### After (with range check)
```asm
and   w8, w0, #0xff         ; Extract byte
sub   w8, w8, #5            ; Subtract range start
cmp   w8, #39               ; Compare to range size
cset  w0, lo                ; Set result
```
**4 instructions** with simple arithmetic

## Performance Impact

- **20% fewer instructions** (5 → 4)
- **Simpler logic** = better CPU pipeline utilization
- **No complex constants** = smaller code size
- **Better branch prediction** with single comparison

This is particularly important because:
- `is_any_keyword()` is called from `advance()` on **every single token**
- This is one of the hottest code paths in the entire parser

## Testing

Added unit tests to verify that:
- All keywords remain contiguous in the enum (`Await..=Yield`)
- All numeric literals remain contiguous (`Decimal..=HexBigInt`)

These tests will catch any future enum reordering that would break the optimization.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Boshen added a commit that referenced this pull request Jan 18, 2026
Cache `ptr` and `chunk_start` fields directly in `Bump` struct to eliminate
pointer indirection through `ChunkFooter` in the allocation fast path.

Before (2 dependent loads):
```asm
ldr x9, [x0, #16]        ; Load footer ptr from Bump
ldr x8, [x9, #32]        ; Load ptr from footer (WAITS for x9!)
```

After (2 independent loads):
```asm
ldr x8, [x0]             ; Load ptr directly (offset 0)
ldr x9, [x0, #8]         ; Load chunk_start directly - PARALLEL!
```

This removes the data dependency between loads, allowing ARM to issue
both loads in parallel via out-of-order execution.

Changes:
- Add `ptr` and `chunk_start` cached fields to `Bump` struct
- Add `#[repr(C)]` to ensure field ordering for optimal cache access
- Update `try_alloc_layout_fast` to use direct field access
- Sync cached fields on slow path (new chunk allocation) and iteration
- Update helper methods to use cached ptr

Size impact: `Bump` grows from 24 to 40 bytes - acceptable tradeoff
for the hot path optimization.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
graphite-app Bot pushed a commit that referenced this pull request Jun 11, 2026
## Summary

- Change `Codegen::wrap` from `FnMut` to `FnOnce`, matching how the helper is used: every closure passed to `wrap` is invoked exactly once.
- Add an assembly comparison note showing the optimized codegen difference before and after the change.

## `Fn`, `FnMut`, and `FnOnce` ?

`Fn`, `FnMut`, and `FnOnce` all describe how a closure may be called:

- `Fn` is the most restrictive for the closure body: it is callable through `&self`, can be called repeatedly, and cannot require mutable or consuming access to captured state.
- `FnMut` is callable through `&mut self`, can be called repeatedly, and may mutate captured state.
- `FnOnce` is callable by value, may consume captured state, and is only guaranteed to be callable once.

The trait relationship goes from most specific to most general call capability:

```rust
Fn: FnMut
FnMut: FnOnce
```

So accepting `FnOnce` is the least restrictive bound for a callback that is only invoked once. It still accepts `Fn` and `FnMut` closures, but it also tells the optimizer that `wrap` does not need a reusable mutable closure object.

## Assembly Impact

`Codegen::wrap` is an inline generic helper, so there is no stable standalone `wrap` assembly symbol in release output. The impact shows up in monomorphized call sites such as `Class::gen`, `Function::gen`, and expression `gen_expr` closures.

Before, several call sites materialized a closure environment on the stack before calling the closure:

```asm
strb    w9, [sp, #15]
add     x9, sp, #15
stp     x0, x9, [sp, #16]
add     x0, sp, #16
bl      <...>::{{closure}}
```

After the `FnOnce` bound, the same shape can pass the one-shot closure state directly through registers:

```asm
and     w1, w2, #0xfffffffd
mov     x2, x19
bl      <...>::{{closure}}
```

Some wrapper frames also shrink. For example, representative `Class::gen` / `Function::gen` paths go from a `64` byte frame to a `48` byte frame:

```asm
- sub     sp, sp, #64
- stp     x20, x19, [sp, #32]
- stp     x29, x30, [sp, #48]
+ sub     sp, sp, #48
+ stp     x20, x19, [sp, #16]
+ stp     x29, x30, [sp, #32]
```

Several restored-frame return paths also become tail calls:

```asm
ldp     x29, x30, [sp, #32]
ldp     x20, x19, [sp, #16]
add     sp, sp, #48
b       <closure or push_slow target>
```

The assembly diff also contains expected local label renumbering noise, such as switch-table suffixes changing from `.318` to `.330`; those are not behavior changes.
graphite-app Bot pushed a commit that referenced this pull request Jun 25, 2026
This avoids calling `Expression::span()` when wrapping an expression-bodied arrow function in a synthetic `ExpressionStatement`.

The parser already knows the expression body starts at the current token before parsing and ends at `prev_token_end` afterwards, so this can use `start_span()` / `end_span()` directly. That keeps the returned `Expression` in registers instead of materializing it on the stack just to call the generated `GetSpan` impl.

Before, the optimized parser assembly spilled the returned expression and called `Expression::span()`:

```asm
bl      parse_assignment_expression_or_higher_impl
mov     x20, x0
mov     x22, x1
strh    w21, [x19, #1196]
strb    w0, [sp, #16]
str     x1, [sp, #24]
add     x0, sp, #16
bl      GetSpan_for_Expression_span
...
str     x0, [x21]       ; ExpressionStatement span
strb    w20, [x21, #16] ; Expression tag
str     x22, [x21, #24] ; Expression payload
```

After, the span is built from parser token state and the expression result is written directly into the arena allocation:

```asm
ldr     x20, [x0, #816]  ; current token span before parse
...
bl      parse_assignment_expression_or_higher_impl
strh    w21, [x19, #1196]
ldr     w8, [x19, #1192] ; previous token end after parse
bfi     x20, x8, #32, #32
...
str     x20, [x21]      ; ExpressionStatement span
strb    w0, [x21, #16]  ; Expression tag
str     x1, [x21, #24]  ; Expression payload
```

This also reduced the first monomorphized parse_arrow_function_expression_body stack frame from 432 bytes to 416 bytes in my local release assembly.
camc314 added a commit that referenced this pull request Jul 3, 2026
## Summary

- Change `Codegen::wrap` from `FnMut` to `FnOnce`, matching how the helper is used: every closure passed to `wrap` is invoked exactly once.
- Add an assembly comparison note showing the optimized codegen difference before and after the change.

## `Fn`, `FnMut`, and `FnOnce` ?

`Fn`, `FnMut`, and `FnOnce` all describe how a closure may be called:

- `Fn` is the most restrictive for the closure body: it is callable through `&self`, can be called repeatedly, and cannot require mutable or consuming access to captured state.
- `FnMut` is callable through `&mut self`, can be called repeatedly, and may mutate captured state.
- `FnOnce` is callable by value, may consume captured state, and is only guaranteed to be callable once.

The trait relationship goes from most specific to most general call capability:

```rust
Fn: FnMut
FnMut: FnOnce
```

So accepting `FnOnce` is the least restrictive bound for a callback that is only invoked once. It still accepts `Fn` and `FnMut` closures, but it also tells the optimizer that `wrap` does not need a reusable mutable closure object.

## Assembly Impact

`Codegen::wrap` is an inline generic helper, so there is no stable standalone `wrap` assembly symbol in release output. The impact shows up in monomorphized call sites such as `Class::gen`, `Function::gen`, and expression `gen_expr` closures.

Before, several call sites materialized a closure environment on the stack before calling the closure:

```asm
strb    w9, [sp, #15]
add     x9, sp, #15
stp     x0, x9, [sp, #16]
add     x0, sp, #16
bl      <...>::{{closure}}
```

After the `FnOnce` bound, the same shape can pass the one-shot closure state directly through registers:

```asm
and     w1, w2, #0xfffffffd
mov     x2, x19
bl      <...>::{{closure}}
```

Some wrapper frames also shrink. For example, representative `Class::gen` / `Function::gen` paths go from a `64` byte frame to a `48` byte frame:

```asm
- sub     sp, sp, #64
- stp     x20, x19, [sp, #32]
- stp     x29, x30, [sp, #48]
+ sub     sp, sp, #48
+ stp     x20, x19, [sp, #16]
+ stp     x29, x30, [sp, #32]
```

Several restored-frame return paths also become tail calls:

```asm
ldp     x29, x30, [sp, #32]
ldp     x20, x19, [sp, #16]
add     sp, sp, #48
b       <closure or push_slow target>
```

The assembly diff also contains expected local label renumbering noise, such as switch-table suffixes changing from `.318` to `.330`; those are not behavior changes.
camc314 added a commit that referenced this pull request Jul 3, 2026
This avoids calling `Expression::span()` when wrapping an expression-bodied arrow function in a synthetic `ExpressionStatement`.

The parser already knows the expression body starts at the current token before parsing and ends at `prev_token_end` afterwards, so this can use `start_span()` / `end_span()` directly. That keeps the returned `Expression` in registers instead of materializing it on the stack just to call the generated `GetSpan` impl.

Before, the optimized parser assembly spilled the returned expression and called `Expression::span()`:

```asm
bl      parse_assignment_expression_or_higher_impl
mov     x20, x0
mov     x22, x1
strh    w21, [x19, #1196]
strb    w0, [sp, #16]
str     x1, [sp, #24]
add     x0, sp, #16
bl      GetSpan_for_Expression_span
...
str     x0, [x21]       ; ExpressionStatement span
strb    w20, [x21, #16] ; Expression tag
str     x22, [x21, #24] ; Expression payload
```

After, the span is built from parser token state and the expression result is written directly into the arena allocation:

```asm
ldr     x20, [x0, #816]  ; current token span before parse
...
bl      parse_assignment_expression_or_higher_impl
strh    w21, [x19, #1196]
ldr     w8, [x19, #1192] ; previous token end after parse
bfi     x20, x8, #32, #32
...
str     x20, [x21]      ; ExpressionStatement span
strb    w0, [x21, #16]  ; Expression tag
str     x1, [x21, #24]  ; Expression payload
```

This also reduced the first monomorphized parse_arrow_function_expression_body stack frame from 432 bytes to 416 bytes in my local release assembly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant