Skip to content

perf(regular_expression): skip capturing-group pre-parse when pattern has no (#23908

Merged
leaysgur merged 3 commits into
oxc-project:mainfrom
linyiru:perf/regex-skip-prescan
Jun 29, 2026
Merged

perf(regular_expression): skip capturing-group pre-parse when pattern has no (#23908
leaysgur merged 3 commits into
oxc-project:mainfrom
linyiru:perf/regex-skip-prescan

Conversation

@linyiru

@linyiru linyiru commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

What

PatternParser::parse always runs a full pre-pass over the whole pattern to count
capturing groups and collect named-group names before the real parse. The existing
comment already flags this:

// NOTE: It means that this perform 2 loops for every cases.
// - Cons: 1st pass is completely useless if the pattern does not contain any capturing groups
// We may re-consider this if we need more performance rather than simplicity.

A pattern with no ( can have no capturing group, no named group and no duplicate
name, so the pre-parse results are exactly State::new's defaults — only
named_capture_groups differs, and for a group-free pattern it depends solely on the
u/v flag. This detects the absence of ( with a cheap scan over the
already-decoded units (Reader::contains) and skips the pre-parse and its
allocations (the AlternativeTracker vec, the per-named-group vecs, the group-name
FxHashSet) in that case.

A ( that turns out to be escaped (\() or non-capturing ((?:…), lookarounds)
still takes the existing slow path — contains('(') is a conservative
over-approximation, so it is only ever skipped when the pattern is provably
group-free.

Correctness

Output is byte-identical. I verified this with a differential harness that parses 60
patterns with both the old and new code and compares the resulting AST / error
exactly. The corpus deliberately covers the tricky equivalences:

  • backreferences in group-free patterns: \1, \1\2\3, a\1b (legacy octal in
    Annex B) — both with and without the u flag;
  • \k<a> in a group-free pattern with "", "u", "v" flags (the only case where
    named_capture_groups is non-default);
  • a ( that must still take the slow path: \(, [(], (?:…), lookarounds,
    named groups, duplicate-name errors ((?<a>)(?<a>)), unterminated (.

All 60 cases produce identical output, and cargo test -p oxc_regular_expression
passes.

Measurement

Local microbench over 15 realistic paren-free regexes (release):

ns/parse
before ~258 ns
after ~230 ns

~11% faster for paren-free patterns, which are the large majority of real-world
regexes. The change removes a whole code-point pass plus its allocations for those
patterns.

The allocation snapshot (cargo allocs) confirms the reduction on real files — the
allocation count drops while total bytes are unchanged:

file allocs before → after
antd.js 3720 → 3661
kitchen-sink.tsx 2108 → 2058
pdf.mjs 356 → 337
checker.ts 1825 → 1818
App.tsx 364 → 360

Honest note on CodSpeed: the bench corpus is regex-light (regex parsing is well under
2% of parse time on the heaviest bench file and ~0% on the rest), so the CodSpeed
comment will likely show no change. The win shows up on regex-heavy inputs (e.g.
moment.js / vue.js, where regex parsing is ~7–10% of parse). This PR is offered in
response to the maintainer note quoted above, framed as removing documented useless
work rather than as a benchmark mover.


Disclosure: developed with AI assistance (Claude), reviewed and verified by the author.

@linyiru linyiru requested a review from leaysgur as a code owner June 28, 2026 20:57
@camc314 camc314 added the A-regular-expression Area - Regular Expression label Jun 28, 2026
@codspeed-hq

codspeed-hq Bot commented Jun 28, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 62 untouched benchmarks
⏩ 9 skipped benchmarks1


Comparing linyiru:perf/regex-skip-prescan (aaf5745) with main (7cb85c4)

Open in CodSpeed

Footnotes

  1. 9 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@linyiru linyiru force-pushed the perf/regex-skip-prescan branch from 8504dc4 to 3a5fd7b Compare June 28, 2026 21:06
@linyiru linyiru requested a review from overlookmotel as a code owner June 28, 2026 21:06
… has no `(`

`PatternParser::parse` always ran a full pre-pass over the pattern to count
capturing groups and collect named-group names. As the existing comment notes,
that pass "is completely useless if the pattern does not contain any capturing
groups".

A pattern with no `(` has no capturing group, no named group and no duplicate
name, so the pre-parse results are exactly `State::new`'s defaults — only
`named_capture_groups` differs, and for a group-free pattern it depends solely
on the `u`/`v` flag. Detect the absence of `(` with a cheap scan over the
already-decoded units (`Reader::contains`) and skip the pre-parse and its
allocations in that case. A `(` that turns out to be escaped or non-capturing
still takes the existing slow path, so behavior is preserved.

~11% faster for paren-free regexes in a local microbench; output is
byte-identical (verified by a differential AST/error check over 60 patterns
covering backreferences, `\k<name>`, escaped parens, character classes,
named/duplicate groups and `u`/`v` flags).
@linyiru linyiru force-pushed the perf/regex-skip-prescan branch from 3a5fd7b to 3f5b66f Compare June 28, 2026 21:14

@leaysgur leaysgur left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@leaysgur leaysgur merged commit c5eef8b into oxc-project:main Jun 29, 2026
38 checks passed
camc314 added a commit that referenced this pull request Jun 29, 2026
### 💥 BREAKING CHANGES

- 94fbacb ast: [**BREAKING**] Only export `AstBuilder` and `NONE` in
`builder` module (#23876) (overlookmotel)
- 8de5122 ecmascript: [**BREAKING**] Switch to new `AstBuilder` (#23834)
(overlookmotel)
- dc0ef38 transformer: [**BREAKING**] Switch to new `AstBuilder`
(#23831) (overlookmotel)
- 88f4455 str: [**BREAKING**] `Str` and `Ident` methods take
`&GetAllocator` (#23781) (overlookmotel)
- 36009dd allocator: [**BREAKING**] `GetAllocator::allocator` take
`&self` (#23676) (overlookmotel)
- bd74f9d allocator: [**BREAKING**] Rename `AllocatorAccessor` trait to
`GetAllocator` (#23675) (overlookmotel)

### 🚀 Features

- 326fe25 transformer_plugins: Support `typeof` `define` keys (#23605)
(Alexander Lichter)
- f2091b3 ast: Unify old and new `AstBuilder`s (#23875) (overlookmotel)
- cd1fd12 codegen: Expose `Codegen::print_string` API (#23785) (camc314)
- 785461b ast: Add custom builder methods to AST types (#23651)
(overlookmotel)
- 05d1357 ast: Add AST creation methods to AST types (#23650)
(overlookmotel)
- 2580eda str: Add `Str::from_str_in` and `Ident::from_str_in` methods
(#23767) (overlookmotel)
- 6883fcf minifier: Fold write-once falsy var to false in boolean
context (#23540) (Dunqing)
- fcbf993 allocator: Add `Vec::from_value_in` method (#23718)
(overlookmotel)
- 989ddb7 allocator: Add `Vec::from_box_in` method (#23717)
(overlookmotel)
- 9d1aa7f allocator: Improve `PartialEq` for `Vec` (#23716)
(overlookmotel)

### 🐛 Bug Fixes

- da0e5bf minifier: Don't reorder a closed-over TDZ read when inlining a
var (#23771) (Dunqing)
- 0b3021f allocator: Remove `Vec::from_box_in` (#23873) (overlookmotel)
- 0ab64ec ast: Silence deprecation warnings within files defining
deprecated `AstBuilder` methods (#23889) (overlookmotel)
- 8c07cad all: Enable `disable_old_builder` Cargo feature for `oxc_ast`
crate in tests (#23888) (overlookmotel)
- 3800f01 ast: Legacy `AstBuilder` methods take `self` not `&self`
(#23891) (overlookmotel)
- 869ac20 semantic/cfg: Connect for update exit to loop test (#23791)
(camc314)
- d3e92d5 semantic/cfg: Connect while branches from condition exit
(#23790) (camc314)
- 025045d ast: `ExportNamedDeclaration` plain builder methods return
boxed nodes (#23783) (overlookmotel)
- 7537c58 ast: Fix name of `AstBuilder` method for
`Expression::V8IntrinsicExpression` (#23766) (overlookmotel)
- 3f574f5 traverse: Fix unsoundness in `Traverse` walk functions
(#23745) (overlookmotel)
- 585760f parser: String in AST reference arena (#23721) (overlookmotel)
- 7231d55 allocator: Fix unsound lifetime extension in `Box::new_in`
(#23685) (overlookmotel)

### ⚡ Performance

- d5c916a semantic: Flatten hoisting_variables to avoid per-scope map
allocation (#23927) (Lawrence Lin)
- e71609d minifier: Bail member-expr folding before the side-effect walk
(#23924) (Lawrence Lin)
- e1f89ab minifier: Reduce string allocations folding addition (#23846)
(overlookmotel)
- 9f6ee3b isolated-declarations: Pool scope maps to avoid per-scope
alloc/rehash (#23761) (Boshen)
- 0b07c4c semantic: Avoid heap alloc for catch-clause binding ids
(#23911) (Lawrence Lin)
- c5eef8b regular_expression: Skip capturing-group pre-parse when
pattern has no `(` (#23908) (Lawrence Lin)
- b4f5b4b isolated_declarations: Remove redundant clone of formal
parameter pattern (#23912) (Lawrence Lin)
- 53d083f isolated_declarations: Use `TakeIn` not `CloneIn` (#23847)
(overlookmotel)
- 3ea9304 react_compiler: Use faster API to arena allocate strings
(#23849) (overlookmotel)
- a6d8e45 parser: Avoid span lookup for arrow expression body (#23788)
(camc314)
- e1886a0 transformer, minifier: Use `static_ident!` macro to create
static `Ident`s (#23727) (overlookmotel)
- 5527bef transformer/object-rest-spread: Reduce iteration (#23720)
(overlookmotel)
- 680ffbc transformer: Allocate AST nodes in arena directly (#23711)
(overlookmotel)
- 1c63c66 parser: Allocate AST nodes in arena directly (#23712)
(overlookmotel)
- 3855f0c minifier: Allocate AST nodes in arena directly (#23710)
(overlookmotel)
- d025887 isolated_declarations: Allocate AST nodes in arena directly
(#23709) (overlookmotel)
- 10b96c6 parser: Remove string search from parsing JSX element name
(#23713) (overlookmotel)

### 📚 Documentation

- 3d61dea all: Correct capitalization in comments (#23887)
(overlookmotel)
- aa1ad74 ast: Add `#[deprecated]` to legacy `AstBuilder` methods
(#23877) (overlookmotel)
- a4676db ast: Correct doc comment for `NONE` (#23765) (overlookmotel)
- 419ec80 syntax: Fix typo in doc comment (#23674) (overlookmotel)

### 🛡️ Security

- 3cdd18f deps: Update npm packages (#23690) (renovate[bot])

Co-authored-by: Boshen <1430279+Boshen@users.noreply.github.com>
Co-authored-by: Cameron <cameron.clark@hey.com>
camc314 pushed a commit that referenced this pull request Jul 3, 2026
… has no `(` (#23908)

## What

`PatternParser::parse` always runs a full pre-pass over the whole
pattern to count
capturing groups and collect named-group names before the real parse.
The existing
comment already flags this:

```rust
// NOTE: It means that this perform 2 loops for every cases.
// - Cons: 1st pass is completely useless if the pattern does not contain any capturing groups
// We may re-consider this if we need more performance rather than simplicity.
```

A pattern with no `(` can have no capturing group, no named group and no
duplicate
name, so the pre-parse results are exactly `State::new`'s defaults —
only
`named_capture_groups` differs, and for a group-free pattern it depends
solely on the
`u`/`v` flag. This detects the absence of `(` with a cheap scan over the
already-decoded units (`Reader::contains`) and skips the pre-parse and
its
allocations (the `AlternativeTracker` vec, the per-named-group vecs, the
group-name
`FxHashSet`) in that case.

A `(` that turns out to be escaped (`\(`) or non-capturing (`(?:…)`,
lookarounds)
still takes the existing slow path — `contains('(')` is a conservative
over-approximation, so it is only ever *skipped* when the pattern is
provably
group-free.

## Correctness

Output is byte-identical. I verified this with a differential harness
that parses 60
patterns with both the old and new code and compares the resulting AST /
error
exactly. The corpus deliberately covers the tricky equivalences:

- backreferences in group-free patterns: `\1`, `\1\2\3`, `a\1b` (legacy
octal in
  Annex B) — both with and without the `u` flag;
- `\k<a>` in a group-free pattern with `""`, `"u"`, `"v"` flags (the
only case where
  `named_capture_groups` is non-default);
- a `(` that must still take the slow path: `\(`, `[(]`, `(?:…)`,
lookarounds,
named groups, duplicate-name errors (`(?<a>)(?<a>)`), unterminated `(`.

All 60 cases produce identical output, and `cargo test -p
oxc_regular_expression`
passes.

## Measurement

Local microbench over 15 realistic paren-free regexes (release):

| | ns/parse |
| --- | --- |
| before | ~258 ns |
| after | ~230 ns |

~11% faster for paren-free patterns, which are the large majority of
real-world
regexes. The change removes a whole code-point pass plus its allocations
for those
patterns.

The allocation snapshot (`cargo allocs`) confirms the reduction on real
files — the
allocation *count* drops while total bytes are unchanged:

| file | allocs before → after |
| --- | --- |
| antd.js | 3720 → 3661 |
| kitchen-sink.tsx | 2108 → 2058 |
| pdf.mjs | 356 → 337 |
| checker.ts | 1825 → 1818 |
| App.tsx | 364 → 360 |

Honest note on CodSpeed: the bench corpus is regex-light (regex parsing
is well under
2% of parse time on the heaviest bench file and ~0% on the rest), so the
CodSpeed
comment will likely show no change. The win shows up on regex-heavy
inputs (e.g.
moment.js / vue.js, where regex parsing is ~7–10% of parse). This PR is
offered in
response to the maintainer note quoted above, framed as removing
documented useless
work rather than as a benchmark mover.

---

_Disclosure: developed with AI assistance (Claude), reviewed and
verified by the author._

---------

Co-authored-by: Yuji Sugiura <y.sugiura.0316@gmail.com>
camc314 added a commit that referenced this pull request Jul 3, 2026
### 💥 BREAKING CHANGES

- 94fbacb ast: [**BREAKING**] Only export `AstBuilder` and `NONE` in
`builder` module (#23876) (overlookmotel)
- 8de5122 ecmascript: [**BREAKING**] Switch to new `AstBuilder` (#23834)
(overlookmotel)
- dc0ef38 transformer: [**BREAKING**] Switch to new `AstBuilder`
(#23831) (overlookmotel)
- 88f4455 str: [**BREAKING**] `Str` and `Ident` methods take
`&GetAllocator` (#23781) (overlookmotel)
- 36009dd allocator: [**BREAKING**] `GetAllocator::allocator` take
`&self` (#23676) (overlookmotel)
- bd74f9d allocator: [**BREAKING**] Rename `AllocatorAccessor` trait to
`GetAllocator` (#23675) (overlookmotel)

### 🚀 Features

- 326fe25 transformer_plugins: Support `typeof` `define` keys (#23605)
(Alexander Lichter)
- f2091b3 ast: Unify old and new `AstBuilder`s (#23875) (overlookmotel)
- cd1fd12 codegen: Expose `Codegen::print_string` API (#23785) (camc314)
- 785461b ast: Add custom builder methods to AST types (#23651)
(overlookmotel)
- 05d1357 ast: Add AST creation methods to AST types (#23650)
(overlookmotel)
- 2580eda str: Add `Str::from_str_in` and `Ident::from_str_in` methods
(#23767) (overlookmotel)
- 6883fcf minifier: Fold write-once falsy var to false in boolean
context (#23540) (Dunqing)
- fcbf993 allocator: Add `Vec::from_value_in` method (#23718)
(overlookmotel)
- 989ddb7 allocator: Add `Vec::from_box_in` method (#23717)
(overlookmotel)
- 9d1aa7f allocator: Improve `PartialEq` for `Vec` (#23716)
(overlookmotel)

### 🐛 Bug Fixes

- da0e5bf minifier: Don't reorder a closed-over TDZ read when inlining a
var (#23771) (Dunqing)
- 0b3021f allocator: Remove `Vec::from_box_in` (#23873) (overlookmotel)
- 0ab64ec ast: Silence deprecation warnings within files defining
deprecated `AstBuilder` methods (#23889) (overlookmotel)
- 8c07cad all: Enable `disable_old_builder` Cargo feature for `oxc_ast`
crate in tests (#23888) (overlookmotel)
- 3800f01 ast: Legacy `AstBuilder` methods take `self` not `&self`
(#23891) (overlookmotel)
- 869ac20 semantic/cfg: Connect for update exit to loop test (#23791)
(camc314)
- d3e92d5 semantic/cfg: Connect while branches from condition exit
(#23790) (camc314)
- 025045d ast: `ExportNamedDeclaration` plain builder methods return
boxed nodes (#23783) (overlookmotel)
- 7537c58 ast: Fix name of `AstBuilder` method for
`Expression::V8IntrinsicExpression` (#23766) (overlookmotel)
- 3f574f5 traverse: Fix unsoundness in `Traverse` walk functions
(#23745) (overlookmotel)
- 585760f parser: String in AST reference arena (#23721) (overlookmotel)
- 7231d55 allocator: Fix unsound lifetime extension in `Box::new_in`
(#23685) (overlookmotel)

### ⚡ Performance

- d5c916a semantic: Flatten hoisting_variables to avoid per-scope map
allocation (#23927) (Lawrence Lin)
- e71609d minifier: Bail member-expr folding before the side-effect walk
(#23924) (Lawrence Lin)
- e1f89ab minifier: Reduce string allocations folding addition (#23846)
(overlookmotel)
- 9f6ee3b isolated-declarations: Pool scope maps to avoid per-scope
alloc/rehash (#23761) (Boshen)
- 0b07c4c semantic: Avoid heap alloc for catch-clause binding ids
(#23911) (Lawrence Lin)
- c5eef8b regular_expression: Skip capturing-group pre-parse when
pattern has no `(` (#23908) (Lawrence Lin)
- b4f5b4b isolated_declarations: Remove redundant clone of formal
parameter pattern (#23912) (Lawrence Lin)
- 53d083f isolated_declarations: Use `TakeIn` not `CloneIn` (#23847)
(overlookmotel)
- 3ea9304 react_compiler: Use faster API to arena allocate strings
(#23849) (overlookmotel)
- a6d8e45 parser: Avoid span lookup for arrow expression body (#23788)
(camc314)
- e1886a0 transformer, minifier: Use `static_ident!` macro to create
static `Ident`s (#23727) (overlookmotel)
- 5527bef transformer/object-rest-spread: Reduce iteration (#23720)
(overlookmotel)
- 680ffbc transformer: Allocate AST nodes in arena directly (#23711)
(overlookmotel)
- 1c63c66 parser: Allocate AST nodes in arena directly (#23712)
(overlookmotel)
- 3855f0c minifier: Allocate AST nodes in arena directly (#23710)
(overlookmotel)
- d025887 isolated_declarations: Allocate AST nodes in arena directly
(#23709) (overlookmotel)
- 10b96c6 parser: Remove string search from parsing JSX element name
(#23713) (overlookmotel)

### 📚 Documentation

- 3d61dea all: Correct capitalization in comments (#23887)
(overlookmotel)
- aa1ad74 ast: Add `#[deprecated]` to legacy `AstBuilder` methods
(#23877) (overlookmotel)
- a4676db ast: Correct doc comment for `NONE` (#23765) (overlookmotel)
- 419ec80 syntax: Fix typo in doc comment (#23674) (overlookmotel)

### 🛡️ Security

- 3cdd18f deps: Update npm packages (#23690) (renovate[bot])

Co-authored-by: Boshen <1430279+Boshen@users.noreply.github.com>
Co-authored-by: Cameron <cameron.clark@hey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-regular-expression Area - Regular Expression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants