Skip to content

perf: (?m)^/.*[\w-]+\.php multiline+wildcard 24% slower than stdlib #97

@kolkov

Description

@kolkov

Problem

Pattern (?m)^/.*[\w-]+\.php (multiline anchored with wildcard) is 24% slower than Go stdlib.

Benchmark Results (regex-bench, 6MB input)

Engine Time Matches
Go stdlib 124 ms 1966
Go coregex 153 ms 1966

Regression: coregex is 24% slower (153ms vs 124ms)

Root Cause

The UseAnchoredLiteral strategy is optimized for single-string matching (^prefix.*suffix$), not multiline (?m) mode.

With (?m) flag:

  • ^ matches at position 0 AND after every newline
  • Need to scan entire input for newline positions
  • Current strategy doesn't optimize this case

Expected Behavior

Multiline anchored patterns should be at least as fast as stdlib, ideally faster using:

  1. Suffix prefilter to find .php candidates
  2. Backward verification to line start (find preceding \n)
  3. Forward match from line start

Proposed Solution

New strategy UseMultilineAnchoredLiteral:

  1. Extract suffix literal (.php)
  2. Use memmem to find all .php occurrences
  3. For each candidate, scan backward to find \n or start of input
  4. Verify prefix from line start

Related

  • Issue noticeably slower than core regexp #79: UseAnchoredLiteral for single-string matching (32-133x speedup) ✅
  • http_methods pattern (?m)^(GET|POST|...) is 50x faster (uses Teddy prefilter for literal alternation)

Test Case

re := coregex.MustCompile(`(?m)^/.*[\w-]+\.php`)
// Input: 6MB text with ~2000 lines starting with /path/to/file.php
matches := re.FindAll(data, -1) // Should find 1966 matches

Benchmark Command

cd regex-bench && make build && make run
# Look for http_methods and anchored_php patterns

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions