perf(regular_expression): skip capturing-group pre-parse when pattern has no (#23908
Merged
Merged
Conversation
Merging this PR will not alter performance
Comparing Footnotes
|
8504dc4 to
3a5fd7b
Compare
… has no `(` `PatternParser::parse` always ran a full pre-pass over the pattern to count capturing groups and collect named-group names. As the existing comment notes, that pass "is completely useless if the pattern does not contain any capturing groups". A pattern with no `(` has no capturing group, no named group and no duplicate name, so the pre-parse results are exactly `State::new`'s defaults — only `named_capture_groups` differs, and for a group-free pattern it depends solely on the `u`/`v` flag. Detect the absence of `(` with a cheap scan over the already-decoded units (`Reader::contains`) and skip the pre-parse and its allocations in that case. A `(` that turns out to be escaped or non-capturing still takes the existing slow path, so behavior is preserved. ~11% faster for paren-free regexes in a local microbench; output is byte-identical (verified by a differential AST/error check over 60 patterns covering backreferences, `\k<name>`, escaped parens, character classes, named/duplicate groups and `u`/`v` flags).
3a5fd7b to
3f5b66f
Compare
camc314
added a commit
that referenced
this pull request
Jun 29, 2026
### 💥 BREAKING CHANGES - 94fbacb ast: [**BREAKING**] Only export `AstBuilder` and `NONE` in `builder` module (#23876) (overlookmotel) - 8de5122 ecmascript: [**BREAKING**] Switch to new `AstBuilder` (#23834) (overlookmotel) - dc0ef38 transformer: [**BREAKING**] Switch to new `AstBuilder` (#23831) (overlookmotel) - 88f4455 str: [**BREAKING**] `Str` and `Ident` methods take `&GetAllocator` (#23781) (overlookmotel) - 36009dd allocator: [**BREAKING**] `GetAllocator::allocator` take `&self` (#23676) (overlookmotel) - bd74f9d allocator: [**BREAKING**] Rename `AllocatorAccessor` trait to `GetAllocator` (#23675) (overlookmotel) ### 🚀 Features - 326fe25 transformer_plugins: Support `typeof` `define` keys (#23605) (Alexander Lichter) - f2091b3 ast: Unify old and new `AstBuilder`s (#23875) (overlookmotel) - cd1fd12 codegen: Expose `Codegen::print_string` API (#23785) (camc314) - 785461b ast: Add custom builder methods to AST types (#23651) (overlookmotel) - 05d1357 ast: Add AST creation methods to AST types (#23650) (overlookmotel) - 2580eda str: Add `Str::from_str_in` and `Ident::from_str_in` methods (#23767) (overlookmotel) - 6883fcf minifier: Fold write-once falsy var to false in boolean context (#23540) (Dunqing) - fcbf993 allocator: Add `Vec::from_value_in` method (#23718) (overlookmotel) - 989ddb7 allocator: Add `Vec::from_box_in` method (#23717) (overlookmotel) - 9d1aa7f allocator: Improve `PartialEq` for `Vec` (#23716) (overlookmotel) ### 🐛 Bug Fixes - da0e5bf minifier: Don't reorder a closed-over TDZ read when inlining a var (#23771) (Dunqing) - 0b3021f allocator: Remove `Vec::from_box_in` (#23873) (overlookmotel) - 0ab64ec ast: Silence deprecation warnings within files defining deprecated `AstBuilder` methods (#23889) (overlookmotel) - 8c07cad all: Enable `disable_old_builder` Cargo feature for `oxc_ast` crate in tests (#23888) (overlookmotel) - 3800f01 ast: Legacy `AstBuilder` methods take `self` not `&self` (#23891) (overlookmotel) - 869ac20 semantic/cfg: Connect for update exit to loop test (#23791) (camc314) - d3e92d5 semantic/cfg: Connect while branches from condition exit (#23790) (camc314) - 025045d ast: `ExportNamedDeclaration` plain builder methods return boxed nodes (#23783) (overlookmotel) - 7537c58 ast: Fix name of `AstBuilder` method for `Expression::V8IntrinsicExpression` (#23766) (overlookmotel) - 3f574f5 traverse: Fix unsoundness in `Traverse` walk functions (#23745) (overlookmotel) - 585760f parser: String in AST reference arena (#23721) (overlookmotel) - 7231d55 allocator: Fix unsound lifetime extension in `Box::new_in` (#23685) (overlookmotel) ### ⚡ Performance - d5c916a semantic: Flatten hoisting_variables to avoid per-scope map allocation (#23927) (Lawrence Lin) - e71609d minifier: Bail member-expr folding before the side-effect walk (#23924) (Lawrence Lin) - e1f89ab minifier: Reduce string allocations folding addition (#23846) (overlookmotel) - 9f6ee3b isolated-declarations: Pool scope maps to avoid per-scope alloc/rehash (#23761) (Boshen) - 0b07c4c semantic: Avoid heap alloc for catch-clause binding ids (#23911) (Lawrence Lin) - c5eef8b regular_expression: Skip capturing-group pre-parse when pattern has no `(` (#23908) (Lawrence Lin) - b4f5b4b isolated_declarations: Remove redundant clone of formal parameter pattern (#23912) (Lawrence Lin) - 53d083f isolated_declarations: Use `TakeIn` not `CloneIn` (#23847) (overlookmotel) - 3ea9304 react_compiler: Use faster API to arena allocate strings (#23849) (overlookmotel) - a6d8e45 parser: Avoid span lookup for arrow expression body (#23788) (camc314) - e1886a0 transformer, minifier: Use `static_ident!` macro to create static `Ident`s (#23727) (overlookmotel) - 5527bef transformer/object-rest-spread: Reduce iteration (#23720) (overlookmotel) - 680ffbc transformer: Allocate AST nodes in arena directly (#23711) (overlookmotel) - 1c63c66 parser: Allocate AST nodes in arena directly (#23712) (overlookmotel) - 3855f0c minifier: Allocate AST nodes in arena directly (#23710) (overlookmotel) - d025887 isolated_declarations: Allocate AST nodes in arena directly (#23709) (overlookmotel) - 10b96c6 parser: Remove string search from parsing JSX element name (#23713) (overlookmotel) ### 📚 Documentation - 3d61dea all: Correct capitalization in comments (#23887) (overlookmotel) - aa1ad74 ast: Add `#[deprecated]` to legacy `AstBuilder` methods (#23877) (overlookmotel) - a4676db ast: Correct doc comment for `NONE` (#23765) (overlookmotel) - 419ec80 syntax: Fix typo in doc comment (#23674) (overlookmotel) ### 🛡️ Security - 3cdd18f deps: Update npm packages (#23690) (renovate[bot]) Co-authored-by: Boshen <1430279+Boshen@users.noreply.github.com> Co-authored-by: Cameron <cameron.clark@hey.com>
camc314
pushed a commit
that referenced
this pull request
Jul 3, 2026
… has no `(` (#23908) ## What `PatternParser::parse` always runs a full pre-pass over the whole pattern to count capturing groups and collect named-group names before the real parse. The existing comment already flags this: ```rust // NOTE: It means that this perform 2 loops for every cases. // - Cons: 1st pass is completely useless if the pattern does not contain any capturing groups // We may re-consider this if we need more performance rather than simplicity. ``` A pattern with no `(` can have no capturing group, no named group and no duplicate name, so the pre-parse results are exactly `State::new`'s defaults — only `named_capture_groups` differs, and for a group-free pattern it depends solely on the `u`/`v` flag. This detects the absence of `(` with a cheap scan over the already-decoded units (`Reader::contains`) and skips the pre-parse and its allocations (the `AlternativeTracker` vec, the per-named-group vecs, the group-name `FxHashSet`) in that case. A `(` that turns out to be escaped (`\(`) or non-capturing (`(?:…)`, lookarounds) still takes the existing slow path — `contains('(')` is a conservative over-approximation, so it is only ever *skipped* when the pattern is provably group-free. ## Correctness Output is byte-identical. I verified this with a differential harness that parses 60 patterns with both the old and new code and compares the resulting AST / error exactly. The corpus deliberately covers the tricky equivalences: - backreferences in group-free patterns: `\1`, `\1\2\3`, `a\1b` (legacy octal in Annex B) — both with and without the `u` flag; - `\k<a>` in a group-free pattern with `""`, `"u"`, `"v"` flags (the only case where `named_capture_groups` is non-default); - a `(` that must still take the slow path: `\(`, `[(]`, `(?:…)`, lookarounds, named groups, duplicate-name errors (`(?<a>)(?<a>)`), unterminated `(`. All 60 cases produce identical output, and `cargo test -p oxc_regular_expression` passes. ## Measurement Local microbench over 15 realistic paren-free regexes (release): | | ns/parse | | --- | --- | | before | ~258 ns | | after | ~230 ns | ~11% faster for paren-free patterns, which are the large majority of real-world regexes. The change removes a whole code-point pass plus its allocations for those patterns. The allocation snapshot (`cargo allocs`) confirms the reduction on real files — the allocation *count* drops while total bytes are unchanged: | file | allocs before → after | | --- | --- | | antd.js | 3720 → 3661 | | kitchen-sink.tsx | 2108 → 2058 | | pdf.mjs | 356 → 337 | | checker.ts | 1825 → 1818 | | App.tsx | 364 → 360 | Honest note on CodSpeed: the bench corpus is regex-light (regex parsing is well under 2% of parse time on the heaviest bench file and ~0% on the rest), so the CodSpeed comment will likely show no change. The win shows up on regex-heavy inputs (e.g. moment.js / vue.js, where regex parsing is ~7–10% of parse). This PR is offered in response to the maintainer note quoted above, framed as removing documented useless work rather than as a benchmark mover. --- _Disclosure: developed with AI assistance (Claude), reviewed and verified by the author._ --------- Co-authored-by: Yuji Sugiura <y.sugiura.0316@gmail.com>
camc314
added a commit
that referenced
this pull request
Jul 3, 2026
### 💥 BREAKING CHANGES - 94fbacb ast: [**BREAKING**] Only export `AstBuilder` and `NONE` in `builder` module (#23876) (overlookmotel) - 8de5122 ecmascript: [**BREAKING**] Switch to new `AstBuilder` (#23834) (overlookmotel) - dc0ef38 transformer: [**BREAKING**] Switch to new `AstBuilder` (#23831) (overlookmotel) - 88f4455 str: [**BREAKING**] `Str` and `Ident` methods take `&GetAllocator` (#23781) (overlookmotel) - 36009dd allocator: [**BREAKING**] `GetAllocator::allocator` take `&self` (#23676) (overlookmotel) - bd74f9d allocator: [**BREAKING**] Rename `AllocatorAccessor` trait to `GetAllocator` (#23675) (overlookmotel) ### 🚀 Features - 326fe25 transformer_plugins: Support `typeof` `define` keys (#23605) (Alexander Lichter) - f2091b3 ast: Unify old and new `AstBuilder`s (#23875) (overlookmotel) - cd1fd12 codegen: Expose `Codegen::print_string` API (#23785) (camc314) - 785461b ast: Add custom builder methods to AST types (#23651) (overlookmotel) - 05d1357 ast: Add AST creation methods to AST types (#23650) (overlookmotel) - 2580eda str: Add `Str::from_str_in` and `Ident::from_str_in` methods (#23767) (overlookmotel) - 6883fcf minifier: Fold write-once falsy var to false in boolean context (#23540) (Dunqing) - fcbf993 allocator: Add `Vec::from_value_in` method (#23718) (overlookmotel) - 989ddb7 allocator: Add `Vec::from_box_in` method (#23717) (overlookmotel) - 9d1aa7f allocator: Improve `PartialEq` for `Vec` (#23716) (overlookmotel) ### 🐛 Bug Fixes - da0e5bf minifier: Don't reorder a closed-over TDZ read when inlining a var (#23771) (Dunqing) - 0b3021f allocator: Remove `Vec::from_box_in` (#23873) (overlookmotel) - 0ab64ec ast: Silence deprecation warnings within files defining deprecated `AstBuilder` methods (#23889) (overlookmotel) - 8c07cad all: Enable `disable_old_builder` Cargo feature for `oxc_ast` crate in tests (#23888) (overlookmotel) - 3800f01 ast: Legacy `AstBuilder` methods take `self` not `&self` (#23891) (overlookmotel) - 869ac20 semantic/cfg: Connect for update exit to loop test (#23791) (camc314) - d3e92d5 semantic/cfg: Connect while branches from condition exit (#23790) (camc314) - 025045d ast: `ExportNamedDeclaration` plain builder methods return boxed nodes (#23783) (overlookmotel) - 7537c58 ast: Fix name of `AstBuilder` method for `Expression::V8IntrinsicExpression` (#23766) (overlookmotel) - 3f574f5 traverse: Fix unsoundness in `Traverse` walk functions (#23745) (overlookmotel) - 585760f parser: String in AST reference arena (#23721) (overlookmotel) - 7231d55 allocator: Fix unsound lifetime extension in `Box::new_in` (#23685) (overlookmotel) ### ⚡ Performance - d5c916a semantic: Flatten hoisting_variables to avoid per-scope map allocation (#23927) (Lawrence Lin) - e71609d minifier: Bail member-expr folding before the side-effect walk (#23924) (Lawrence Lin) - e1f89ab minifier: Reduce string allocations folding addition (#23846) (overlookmotel) - 9f6ee3b isolated-declarations: Pool scope maps to avoid per-scope alloc/rehash (#23761) (Boshen) - 0b07c4c semantic: Avoid heap alloc for catch-clause binding ids (#23911) (Lawrence Lin) - c5eef8b regular_expression: Skip capturing-group pre-parse when pattern has no `(` (#23908) (Lawrence Lin) - b4f5b4b isolated_declarations: Remove redundant clone of formal parameter pattern (#23912) (Lawrence Lin) - 53d083f isolated_declarations: Use `TakeIn` not `CloneIn` (#23847) (overlookmotel) - 3ea9304 react_compiler: Use faster API to arena allocate strings (#23849) (overlookmotel) - a6d8e45 parser: Avoid span lookup for arrow expression body (#23788) (camc314) - e1886a0 transformer, minifier: Use `static_ident!` macro to create static `Ident`s (#23727) (overlookmotel) - 5527bef transformer/object-rest-spread: Reduce iteration (#23720) (overlookmotel) - 680ffbc transformer: Allocate AST nodes in arena directly (#23711) (overlookmotel) - 1c63c66 parser: Allocate AST nodes in arena directly (#23712) (overlookmotel) - 3855f0c minifier: Allocate AST nodes in arena directly (#23710) (overlookmotel) - d025887 isolated_declarations: Allocate AST nodes in arena directly (#23709) (overlookmotel) - 10b96c6 parser: Remove string search from parsing JSX element name (#23713) (overlookmotel) ### 📚 Documentation - 3d61dea all: Correct capitalization in comments (#23887) (overlookmotel) - aa1ad74 ast: Add `#[deprecated]` to legacy `AstBuilder` methods (#23877) (overlookmotel) - a4676db ast: Correct doc comment for `NONE` (#23765) (overlookmotel) - 419ec80 syntax: Fix typo in doc comment (#23674) (overlookmotel) ### 🛡️ Security - 3cdd18f deps: Update npm packages (#23690) (renovate[bot]) Co-authored-by: Boshen <1430279+Boshen@users.noreply.github.com> Co-authored-by: Cameron <cameron.clark@hey.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
PatternParser::parsealways runs a full pre-pass over the whole pattern to countcapturing groups and collect named-group names before the real parse. The existing
comment already flags this:
A pattern with no
(can have no capturing group, no named group and no duplicatename, so the pre-parse results are exactly
State::new's defaults — onlynamed_capture_groupsdiffers, and for a group-free pattern it depends solely on theu/vflag. This detects the absence of(with a cheap scan over thealready-decoded units (
Reader::contains) and skips the pre-parse and itsallocations (the
AlternativeTrackervec, the per-named-group vecs, the group-nameFxHashSet) in that case.A
(that turns out to be escaped (\() or non-capturing ((?:…), lookarounds)still takes the existing slow path —
contains('(')is a conservativeover-approximation, so it is only ever skipped when the pattern is provably
group-free.
Correctness
Output is byte-identical. I verified this with a differential harness that parses 60
patterns with both the old and new code and compares the resulting AST / error
exactly. The corpus deliberately covers the tricky equivalences:
\1,\1\2\3,a\1b(legacy octal inAnnex B) — both with and without the
uflag;\k<a>in a group-free pattern with"","u","v"flags (the only case wherenamed_capture_groupsis non-default);(that must still take the slow path:\(,[(],(?:…), lookarounds,named groups, duplicate-name errors (
(?<a>)(?<a>)), unterminated(.All 60 cases produce identical output, and
cargo test -p oxc_regular_expressionpasses.
Measurement
Local microbench over 15 realistic paren-free regexes (release):
~11% faster for paren-free patterns, which are the large majority of real-world
regexes. The change removes a whole code-point pass plus its allocations for those
patterns.
The allocation snapshot (
cargo allocs) confirms the reduction on real files — theallocation count drops while total bytes are unchanged:
Honest note on CodSpeed: the bench corpus is regex-light (regex parsing is well under
2% of parse time on the heaviest bench file and ~0% on the rest), so the CodSpeed
comment will likely show no change. The win shows up on regex-heavy inputs (e.g.
moment.js / vue.js, where regex parsing is ~7–10% of parse). This PR is offered in
response to the maintainer note quoted above, framed as removing documented useless
work rather than as a benchmark mover.
Disclosure: developed with AI assistance (Claude), reviewed and verified by the author.