Skip to content

Faster identifier tokenizing#13262

Merged
JLHwung merged 2 commits intobabel:mainfrom
JLHwung:faster-identifier-tokenizing
May 6, 2021
Merged

Faster identifier tokenizing#13262
JLHwung merged 2 commits intobabel:mainfrom
JLHwung:faster-identifier-tokenizing

Conversation

@JLHwung
Copy link
Copy Markdown
Contributor

@JLHwung JLHwung commented May 5, 2021

Q                       A
License MIT

This PR includes commits on #13256 , see here for the real diff. Edits: already rebased.

This PR

  1. moves the flow iterator @@iterator parsing to the flow plugin and simplifies the tokenizer state. It turns out we don't actually need to access this.state.isIterator as long as we have a dedicated code path for Flow iterator. So state.isIterator is removed.

  2. passes through the identifier start so we don't have to re-read the input source and test with isIdentifierChar

Currently, when parsing a length-3 input ab;, this.input.codePointAt is called for 5 times:

Seq Position Character Context
1 0 "a" getTokenFromCode
2 0 "a" readWord1
3 1 "b" readWord1
4 2 ";" readWord1
5 2 ";" getTokenFromCode

This PR passes through the read code point 0x61 in sequence 1 for readWord1 so we can avoid reading again in sequence 2. Now if we are parsing ab;, only 4 codePointAt call will be issued (1, 3, 4, 5). As we can see, the margin is diminishing for longer identifier names.

Although we could do the same for escaped identifiers, I don't think it worths the efforts because escaped identifiers are rare.

Benchmark results

Combing these two tricks we see up to 4.5% performance gain on length-1 identifiers (Best case).

$ node --predictable ./benchmark/many-identifiers/1-length.bench.mjs
baseline 64 length-1 identifiers: 15836 ops/sec ±67.43% (0.063ms)
baseline 128 length-1 identifiers: 16041 ops/sec ±2.09% (0.062ms)
baseline 256 length-1 identifiers: 8463 ops/sec ±1.48% (0.118ms)
baseline 512 length-1 identifiers: 4261 ops/sec ±1.21% (0.235ms)
baseline 1024 length-1 identifiers: 2165 ops/sec ±1.03% (0.462ms)
current 64 length-1 identifiers: 20153 ops/sec ±81.04% (0.05ms)
current 128 length-1 identifiers: 17915 ops/sec ±0.48% (0.056ms)
current 256 length-1 identifiers: 8844 ops/sec ±0.96% (0.113ms)
current 512 length-1 identifiers: 4410 ops/sec ±1.42% (0.227ms)
current 1024 length-1 identifiers: 2191 ops/sec ±1.22% (0.456ms)

up to 2.5% performance gain on length-2 identifiers

$ node --predictable ./benchmark/many-identifiers/2-length.bench.mjs
baseline 64 length-2 identifiers: 20141 ops/sec ±66.04% (0.05ms)
baseline 128 length-2 identifiers: 15641 ops/sec ±1.62% (0.064ms)
baseline 256 length-2 identifiers: 7981 ops/sec ±1.06% (0.125ms)
baseline 512 length-2 identifiers: 3959 ops/sec ±1.11% (0.253ms)
baseline 1024 length-2 identifiers: 1935 ops/sec ±1.47% (0.517ms)
current 64 length-2 identifiers: 21245 ops/sec ±63.76% (0.047ms)
current 128 length-2 identifiers: 15943 ops/sec ±1.32% (0.063ms)
current 256 length-2 identifiers: 8079 ops/sec ±1.23% (0.124ms)
current 512 length-2 identifiers: 4064 ops/sec ±0.99% (0.246ms)
current 1024 length-2 identifiers: 2051 ops/sec ±0.2% (0.488ms)

and no significant performance gain on length-20 identifiers. Note that in predictable mode, the first bench suite always has significant variant and should not be taken into account.

baseline 64 length-20 identifiers: 14220 ops/sec ±69.35% (0.07ms)
baseline 128 length-20 identifiers: 11560 ops/sec ±0.96% (0.087ms)
baseline 256 length-20 identifiers: 5885 ops/sec ±0.54% (0.17ms)
baseline 512 length-20 identifiers: 2902 ops/sec ±1.38% (0.345ms)
baseline 1024 length-20 identifiers: 1429 ops/sec ±1.43% (0.7ms)
current 64 length-20 identifiers: 16010 ops/sec ±55.29% (0.062ms)
current 128 length-20 identifiers: 11621 ops/sec ±1.02% (0.086ms)
current 256 length-20 identifiers: 5901 ops/sec ±0.5% (0.169ms)
current 512 length-20 identifiers: 2906 ops/sec ±0.91% (0.344ms)
current 1024 length-20 identifiers: 1417 ops/sec ±2.47% (0.706ms)

@JLHwung JLHwung added pkg: parser PR: Performance 🏃‍♀️ A type of pull request used for our changelog categories labels May 5, 2021
@codesandbox-ci
Copy link
Copy Markdown

codesandbox-ci bot commented May 5, 2021

This pull request is automatically built and testable in CodeSandbox.

To see build info of the built libraries, click here or the icon next to each commit SHA.

Latest deployment of this branch, based on commit 00c42cc:

Sandbox Source
babel-repl-custom-plugin Configuration
babel-plugin-multi-config Configuration

@babel-bot
Copy link
Copy Markdown
Collaborator

babel-bot commented May 5, 2021

Build successful! You can test your changes in the REPL here: https://babeljs.io/repl/build/45886/

// Allow @@iterator and @@asyncIterator as a identifier only inside type
if (!this.isIterator(word) || !this.state.inType) {
this.raise(this.state.pos, Errors.InvalidIdentifier, fullWord);
}
Copy link
Copy Markdown
Contributor Author

@JLHwung JLHwung May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of readIterator is copied from original readWord. It oversights that an iterator identifier may contain escapes and thus should not be allowed:

For example, currently Babel parses it successfully

function foo(): { @@iter\u0061tor: () => string } {
  return (0: any);
}

This PR focuses on performance only and I will open a new PR for that after this PR gets merged.

Copy link
Copy Markdown
Member

@nicolo-ribaudo nicolo-ribaudo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably less than 1% when parsing real files, but I like that it moves Flow stuff to the Flow plugin.

JLHwung added 2 commits May 6, 2021 12:31
- Mover iterator identifier parsing to the Flow plugin
- If the character is an identifier start, pass it to readWord1
@JLHwung JLHwung force-pushed the faster-identifier-tokenizing branch from a20aef9 to 00c42cc Compare May 6, 2021 16:31
@JLHwung JLHwung merged commit a8fea40 into babel:main May 6, 2021
@JLHwung JLHwung deleted the faster-identifier-tokenizing branch May 6, 2021 22:47
@github-actions github-actions bot added the outdated A closed issue/PR that is archived due to age. Recommended to make a new issue label Aug 6, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

outdated A closed issue/PR that is archived due to age. Recommended to make a new issue pkg: parser PR: Performance 🏃‍♀️ A type of pull request used for our changelog categories

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants