Skip to content

bug: Dot metacharacter matches bytes instead of UTF-8 codepoints #85

@kolkov

Description

@kolkov

Description

The . metacharacter incorrectly matches individual bytes instead of UTF-8 codepoints on multibyte input.

Reproduction

package main

import (
    "fmt"
    "regexp"
    "github.com/coregx/coregex"
)

func main() {
    input := "日本語"  // 9 bytes, 3 codepoints
    
    stdRe := regexp.MustCompile(`.`)
    cgRe := coregex.MustCompile(`.`)
    
    fmt.Printf("stdlib:  %d matches %v\n", len(stdRe.FindAllString(input, -1)), stdRe.FindAllString(input, -1))
    fmt.Printf("coregex: %d matches %v\n", len(cgRe.FindAllString(input, -1)), cgRe.FindAllString(input, -1))
}

Expected (stdlib)

stdlib:  3 matches [日 本 語]

Actual (coregex)

coregex: 9 matches [� � � � � � � � �]

Impact

Critical - Affects ALL patterns using . on non-ASCII text:

  • .+, .*, .?
  • (.) capture groups
  • Any pattern with dot metacharacter

Affected Patterns

  • . - single dot
  • .+, .*, .? - dot with quantifiers
  • (.) - dot in capture group
  • \D, \W, \S - may also be affected (negated classes)

Root Cause

Likely in NFA/DFA handling of OpAnyCharNotNL - matching byte-by-byte instead of rune-by-rune.

Priority

🔴 Critical - Breaks Unicode compatibility, blocking v1.0.0

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions