-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Problem
Pattern (?m)^/.*\.php on 6MB input:
| Engine | Time |
|---|---|
| Go stdlib | 93 ms |
| coregex v0.11.1 | 66 ms |
| Rust regex | 0.79 ms |
Rust is 84x faster than coregex on this pattern.
Root Cause Analysis
Current coregex approach (UseMultilineReverseSuffix)
- Suffix prefilter finds
.phpcandidates - Backward scan to line start (
\n) - Forward PikeVM verification
The PikeVM verification is the bottleneck - it must execute for every candidate match.
Rust regex approach (DFA State Acceleration)
From Rust regex internals:
- Full DFA compiled for the pattern
- For
.*states, most transitions are self-transitions (all bytes except\n) - Rust marks these as "accelerated states"
- Uses
memchr('\n')to skip directly to next line - Continues DFA execution from there
Key insight from regex-automata/src/dfa/special.rs:
accelerated - A state where all of its outgoing transitions, except a
few, loop back to itself. These states are candidates for acceleration
via memchr during search.
Proposed Solution
Option A: DFA Acceleration (like Rust)
Implement DFA state acceleration:
- Detect states with mostly self-transitions
- Store the "escape bytes" (bytes that don't self-loop)
- During search, use
memchrto find escape bytes - Skip directly to next potential state change
Pros: Works for any pattern with repetition
Cons: Requires significant DFA infrastructure changes
Option B: Specialized Multiline Line Scanner
For (?m)^prefix.*suffix patterns:
- Build a specialized line scanner
- Use
memchr('\n')to iterate lines - For each line, check prefix/suffix with simple byte matching
- Only invoke full engine if prefix/suffix match
Pros: Simpler to implement
Cons: Only works for specific pattern shapes
Option C: Hybrid DFA for .* sequences
When pattern contains .*:
- Extract the literal prefix/suffix around
.* - Build mini-DFA for prefix matching
- Use
memchr('\n')to find line boundaries - Match prefix at line start, suffix before
\n
Benchmark Reference
# regex-bench CI (GitHub Actions Ubuntu)
multiline_php 93 ms (stdlib) → 66 ms (coregex) → 0.79 ms (Rust)
Related
- Issue perf: (?m)^/.*[\w-]+\.php multiline+wildcard 24% slower than stdlib #97 (UseMultilineReverseSuffix - partial fix)
- Rust regex: https://github.com/rust-lang/regex
- Blog: https://burntsushi.net/regex-internals/
Metadata
Metadata
Assignees
Labels
No labels