Skip to content

perf: Multiline patterns 84x slower than Rust regex #99

@kolkov

Description

@kolkov

Problem

Pattern (?m)^/.*\.php on 6MB input:

Engine Time
Go stdlib 93 ms
coregex v0.11.1 66 ms
Rust regex 0.79 ms

Rust is 84x faster than coregex on this pattern.

Root Cause Analysis

Current coregex approach (UseMultilineReverseSuffix)

  1. Suffix prefilter finds .php candidates
  2. Backward scan to line start (\n)
  3. Forward PikeVM verification

The PikeVM verification is the bottleneck - it must execute for every candidate match.

Rust regex approach (DFA State Acceleration)

From Rust regex internals:

  1. Full DFA compiled for the pattern
  2. For .* states, most transitions are self-transitions (all bytes except \n)
  3. Rust marks these as "accelerated states"
  4. Uses memchr('\n') to skip directly to next line
  5. Continues DFA execution from there

Key insight from regex-automata/src/dfa/special.rs:

accelerated - A state where all of its outgoing transitions, except a
few, loop back to itself. These states are candidates for acceleration
via memchr during search.

Proposed Solution

Option A: DFA Acceleration (like Rust)

Implement DFA state acceleration:

  1. Detect states with mostly self-transitions
  2. Store the "escape bytes" (bytes that don't self-loop)
  3. During search, use memchr to find escape bytes
  4. Skip directly to next potential state change

Pros: Works for any pattern with repetition
Cons: Requires significant DFA infrastructure changes

Option B: Specialized Multiline Line Scanner

For (?m)^prefix.*suffix patterns:

  1. Build a specialized line scanner
  2. Use memchr('\n') to iterate lines
  3. For each line, check prefix/suffix with simple byte matching
  4. Only invoke full engine if prefix/suffix match

Pros: Simpler to implement
Cons: Only works for specific pattern shapes

Option C: Hybrid DFA for .* sequences

When pattern contains .*:

  1. Extract the literal prefix/suffix around .*
  2. Build mini-DFA for prefix matching
  3. Use memchr('\n') to find line boundaries
  4. Match prefix at line start, suffix before \n

Benchmark Reference

# regex-bench CI (GitHub Actions Ubuntu)
multiline_php    93 ms (stdlib)  →  66 ms (coregex)  →  0.79 ms (Rust)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions