Skip to content

perf(parser): lex identifiers as bytes not chars#2352

Merged
Boshen merged 1 commit intomainfrom
02-08-perf_parser_lex_identifiers_as_bytes_not_chars
Feb 9, 2024
Merged

perf(parser): lex identifiers as bytes not chars#2352
Boshen merged 1 commit intomainfrom
02-08-perf_parser_lex_identifiers_as_bytes_not_chars

Conversation

@overlookmotel
Copy link
Copy Markdown
Member

@overlookmotel overlookmotel commented Feb 8, 2024

This PR re-implements lexing identifiers with a fast path for the most common case - identifiers which are pure ASCII characters, using the new Source / SourcePosition APIs.

Lexing identifiers is a hot path, and accounts for the majority of the time the Lexer spends. The performance bump from this change is (if I do say so myself!) quite decent.

I've spent a lot of time tuning the implementation, which gained a further 10-15% on the Lexer benchmarks compared to my first, simpler attempt. Some of the design decisions, if they look odd, are likely motivated by gains in performance.

Techniques

This implementation uses a few different strategies for performance:

  • Search byte-by-byte, not char-by-char.
  • Process batches of 32 bytes at a time to reduce bounds checks.
  • Mark uncommon paths #[cold].

Structure

The implementation is built in 3 layers:

  1. ASCII characters only.
  2. ASCII and Unicode characters.
  3. \ escape sequences (and all the above).

identifier_name_handler starts at the top layer, and is optimized for consuming ASCII as fast as possible. Each "layer" is considered more uncommon than the previous, and dropping down a layer is a de-opt.

I'm assuming that 95%+ of JavaScript code does not include either Unicode characters or escapes in identifiers, so the speed of the fast path is prioritised.

That said, once a Unicode character is encountered, the next layer does expect to find further Unicode characters, rather than de-opting over and over again. If an identifier starts with a Unicode character, it enters the code straight on the 2nd layer, so is not penalised by going through a #[cold] boundary. Lexing Unicode is never going to be as fast as ASCII, but still I felt it was important not to penalise it unnecessarily, so as not to be Anglo-centric.

ASCII search macro

The main ASCII search is implemented as a macro. I found that, for reasons I don't understand, it's significantly faster to have all the code in a single function, even compared to multiple functions marked #[inline] or #[inline(always)]. The fastest implementation also requires some code to be repeated twice, which is nicer to do with a macro.

This macro, and the ByteMatchTable types that go with it, are designed to be re-usable. Next step will be to apply them for whitespace and strings, which should be fairly simple.

Searching in batches of 32 bytes is also designed to be forward-compatible with SIMD.

Bye bye AutoCow

AutoCow is removed. Instead, a string-builder is only created if it's needed, when a \ escape is first encountered. The string builder is also more efficient than AutoCow was, as it copies bytes in chunks, rather than 1-by-1.

This won't make much difference for identifiers, as escapes are so rare anyway, but this same technique can be used for strings, where they're more common.

@overlookmotel
Copy link
Copy Markdown
Member Author

overlookmotel commented Feb 8, 2024

@github-actions github-actions bot added the A-parser Area - Parser label Feb 8, 2024
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Feb 8, 2024

CodSpeed Performance Report

Merging #2352 will improve performances by 56.76%

Comparing 02-08-perf_parser_lex_identifiers_as_bytes_not_chars (9a8d7b7) with main (6910e4f)

Summary

⚡ 17 improvements
✅ 10 untouched benchmarks

Benchmarks breakdown

Benchmark main lex_identifiers_as_bytes_not_chars Change
transformer[checker.ts] 1.1 s 1.1 s +4.63%
lexer[checker.ts] 146.1 ms 95.6 ms +52.76%
lexer[pdf.mjs] 35.3 ms 22.5 ms +56.76%
parser[cal.com.tsx] 176.4 ms 159.7 ms +10.45%
lexer[cal.com.tsx] 59.7 ms 43 ms +38.74%
transformer[cal.com.tsx] 520.7 ms 504 ms +3.31%
lexer[antd.js] 230 ms 157.7 ms +45.81%
minifier[react.development.js] 12.2 ms 11.2 ms +9.14%
transformer[antd.js] 1.7 s 1.6 s +4.5%
parser[RadixUIAdoptionSection.jsx] 421.9 µs 397.2 µs +6.22%
minifier[typescript.js] 1.8 s 1.6 s +9.65%
transformer[pdf.mjs] 300.9 ms 288.2 ms +4.41%
transformer[RadixUIAdoptionSection.jsx] 819.2 µs 794.8 µs +3.07%
parser[pdf.mjs] 124.3 ms 111.6 ms +11.39%
lexer[RadixUIAdoptionSection.jsx] 196.5 µs 148.7 µs +32.18%
parser[checker.ts] 419.2 ms 369.2 ms +13.55%
parser[antd.js] 773.1 ms 698.3 ms +10.71%

@overlookmotel overlookmotel force-pushed the 02-08-perf_parser_lex_identifiers_as_bytes_not_chars branch from 6360f04 to 5b9f1a7 Compare February 8, 2024 21:23
@overlookmotel overlookmotel marked this pull request as draft February 9, 2024 01:53
@overlookmotel overlookmotel marked this pull request as ready for review February 9, 2024 02:11
@overlookmotel overlookmotel force-pushed the 02-08-perf_parser_lex_identifiers_as_bytes_not_chars branch 2 times, most recently from 5b9f1a7 to 070d3c4 Compare February 9, 2024 02:13
Copy link
Copy Markdown
Member

Boshen commented Feb 9, 2024

Merge activity

  • Feb 8, 10:50 PM EST: @Boshen started a stack merge that includes this pull request via Graphite.
  • Feb 8, 10:56 PM EST: Graphite rebased this pull request as part of a merge.
  • Feb 8, 11:01 PM EST: @Boshen merged this pull request with Graphite.

@Boshen Boshen force-pushed the 02-08-refactor_parser_macro_for_ASCII_identifier_byte_handlers branch from 593a602 to beafc6c Compare February 9, 2024 03:51
Base automatically changed from 02-08-refactor_parser_macro_for_ASCII_identifier_byte_handlers to main February 9, 2024 03:55
Boshen pushed a commit that referenced this pull request Feb 9, 2024
Add a macro for ASCII identifier byte handlers.

This is a preparatory step towards #2352.
@Boshen Boshen force-pushed the 02-08-perf_parser_lex_identifiers_as_bytes_not_chars branch from 070d3c4 to 9a8d7b7 Compare February 9, 2024 03:55
@Boshen Boshen merged commit d3a59f2 into main Feb 9, 2024
@Boshen Boshen deleted the 02-08-perf_parser_lex_identifiers_as_bytes_not_chars branch February 9, 2024 04:01
Boshen pushed a commit that referenced this pull request Feb 9, 2024
Uses the `byte_search!` macro introduced in #2352 to consume whitespace after a line break.
@overlookmotel overlookmotel mentioned this pull request Feb 9, 2024
IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this pull request May 29, 2024
…ct#2351)

Add a macro for ASCII identifier byte handlers.

This is a preparatory step towards oxc-project#2352.
IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this pull request May 29, 2024
This PR re-implements lexing identifiers with a fast path for the most common case - identifiers which are pure ASCII characters, using the new `Source` / `SourcePosition` APIs.

Lexing identifiers is a hot path, and accounts for the majority of the time the Lexer spends. The performance bump from this change is (if I do say so myself!) quite decent.

I've spent a lot of time tuning the implementation, which gained a further 10-15% on the Lexer benchmarks compared to my first, simpler attempt. Some of the design decisions, if they look odd, are likely motivated by gains in performance.

### Techniques

This implementation uses a few different strategies for performance:

* Search byte-by-byte, not char-by-char.
* Process batches of 32 bytes at a time to reduce bounds checks.
* Mark uncommon paths `#[cold]`.

### Structure

The implementation is built in 3 layers:

1. ASCII characters only.
2. ASCII and Unicode characters.
3. `\` escape sequences (and all the above).

`identifier_name_handler` starts at the top layer, and is optimized for consuming ASCII as fast as possible. Each "layer" is considered more uncommon than the previous, and dropping down a layer is a de-opt.

I'm assuming that 95%+ of JavaScript code does not include either Unicode characters or escapes in identifiers, so the speed of the fast path is prioritised.

That said, once a Unicode character is encountered, the next layer does expect to find further Unicode characters, rather than de-opting over and over again. If an identifier *starts* with a Unicode character, it enters the code straight on the 2nd layer, so is not penalised by going through a `#[cold]` boundary. Lexing Unicode is never going to be as fast as ASCII, but still I felt it was important not to penalise it unnecessarily, so as not to be Anglo-centric.

### ASCII search macro

The main ASCII search is implemented as a macro. I found that, for reasons I don't understand, it's significantly faster to have all the code in a single function, even compared to multiple functions marked `#[inline]` or `#[inline(always)]`. The fastest implementation also requires some code to be repeated twice, which is nicer to do with a macro.

This macro, and the `ByteMatchTable` types that go with it, are designed to be re-usable. Next step will be to apply them for whitespace and strings, which should be fairly simple.

Searching in batches of 32 bytes is also designed to be forward-compatible with SIMD.

### Bye bye `AutoCow`

`AutoCow` is removed. Instead, a string-builder is only created if it's needed, when a `\` escape is first encountered. The string builder is also more efficient than `AutoCow` was, as it copies bytes in chunks, rather than 1-by-1.

This won't make much difference for identifiers, as escapes are so rare anyway, but this same technique can be used for strings, where they're more common.
IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this pull request May 29, 2024
Uses the `byte_search!` macro introduced in oxc-project#2352 to consume whitespace after a line break.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-parser Area - Parser

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants