perf: Multiline patterns 84x slower than Rust regex

## Problem

Pattern `(?m)^/.*\.php` on 6MB input:

| Engine | Time | 
|--------|------|
| Go stdlib | 93 ms |
| **coregex v0.11.1** | 66 ms |
| **Rust regex** | **0.79 ms** |

Rust is **84x faster** than coregex on this pattern.

## Root Cause Analysis

### Current coregex approach (UseMultilineReverseSuffix)

1. Suffix prefilter finds `.php` candidates
2. Backward scan to line start (`\n`)
3. Forward **PikeVM** verification

The PikeVM verification is the bottleneck - it must execute for every candidate match.

### Rust regex approach (DFA State Acceleration)

From [Rust regex internals](https://burntsushi.net/regex-internals/):

1. Full DFA compiled for the pattern
2. For `.*` states, most transitions are **self-transitions** (all bytes except `\n`)
3. Rust marks these as **"accelerated states"**
4. Uses `memchr('\n')` to skip directly to next line
5. Continues DFA execution from there

Key insight from `regex-automata/src/dfa/special.rs`:
```
accelerated - A state where all of its outgoing transitions, except a
few, loop back to itself. These states are candidates for acceleration
via memchr during search.
```

## Proposed Solution

### Option A: DFA Acceleration (like Rust)

Implement DFA state acceleration:
1. Detect states with mostly self-transitions
2. Store the "escape bytes" (bytes that don't self-loop)
3. During search, use `memchr` to find escape bytes
4. Skip directly to next potential state change

**Pros**: Works for any pattern with repetition
**Cons**: Requires significant DFA infrastructure changes

### Option B: Specialized Multiline Line Scanner

For `(?m)^prefix.*suffix` patterns:
1. Build a specialized line scanner
2. Use `memchr('\n')` to iterate lines
3. For each line, check prefix/suffix with simple byte matching
4. Only invoke full engine if prefix/suffix match

**Pros**: Simpler to implement
**Cons**: Only works for specific pattern shapes

### Option C: Hybrid DFA for `.*` sequences

When pattern contains `.*`:
1. Extract the literal prefix/suffix around `.*`
2. Build mini-DFA for prefix matching
3. Use `memchr('\n')` to find line boundaries
4. Match prefix at line start, suffix before `\n`

## Benchmark Reference

```
# regex-bench CI (GitHub Actions Ubuntu)
multiline_php    93 ms (stdlib)  →  66 ms (coregex)  →  0.79 ms (Rust)
```

## Related

- Issue #97 (UseMultilineReverseSuffix - partial fix)
- Rust regex: https://github.com/rust-lang/regex
- Blog: https://burntsushi.net/regex-internals/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Multiline patterns 84x slower than Rust regex #99

Problem

Root Cause Analysis

Current coregex approach (UseMultilineReverseSuffix)

Rust regex approach (DFA State Acceleration)

Proposed Solution

Option A: DFA Acceleration (like Rust)

Option B: Specialized Multiline Line Scanner

Option C: Hybrid DFA for `.*` sequences

Benchmark Reference

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Engine	Time
Go stdlib	93 ms
coregex v0.11.1	66 ms
Rust regex	0.79 ms

perf: Multiline patterns 84x slower than Rust regex #99

Description

Problem

Root Cause Analysis

Current coregex approach (UseMultilineReverseSuffix)

Rust regex approach (DFA State Acceleration)

Proposed Solution

Option A: DFA Acceleration (like Rust)

Option B: Specialized Multiline Line Scanner

Option C: Hybrid DFA for .* sequences

Benchmark Reference

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Option C: Hybrid DFA for `.*` sequences