-
Notifications
You must be signed in to change notification settings - Fork 4
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Description
The . metacharacter incorrectly matches individual bytes instead of UTF-8 codepoints on multibyte input.
Reproduction
package main
import (
"fmt"
"regexp"
"github.com/coregx/coregex"
)
func main() {
input := "日本語" // 9 bytes, 3 codepoints
stdRe := regexp.MustCompile(`.`)
cgRe := coregex.MustCompile(`.`)
fmt.Printf("stdlib: %d matches %v\n", len(stdRe.FindAllString(input, -1)), stdRe.FindAllString(input, -1))
fmt.Printf("coregex: %d matches %v\n", len(cgRe.FindAllString(input, -1)), cgRe.FindAllString(input, -1))
}Expected (stdlib)
stdlib: 3 matches [日 本 語]
Actual (coregex)
coregex: 9 matches [� � � � � � � � �]
Impact
Critical - Affects ALL patterns using . on non-ASCII text:
.+,.*,.?(.)capture groups- Any pattern with dot metacharacter
Affected Patterns
.- single dot.+,.*,.?- dot with quantifiers(.)- dot in capture group\D,\W,\S- may also be affected (negated classes)
Root Cause
Likely in NFA/DFA handling of OpAnyCharNotNL - matching byte-by-byte instead of rune-by-rune.
Priority
🔴 Critical - Breaks Unicode compatibility, blocking v1.0.0
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working