feat: v0.10.7 - UTF-8 fixes + 100% stdlib API compatibility #86

kolkov · 2026-01-14T22:41:56Z

Summary

v0.10.7 delivers UTF-8/Unicode fixes and complete stdlib API compatibility.

UTF-8/Unicode Fixes

bug: Dot metacharacter matches bytes instead of UTF-8 codepoints #85: Dot metacharacter now matches UTF-8 codepoints (not bytes)
Case-insensitive patterns don't match lowercase when unanchored #87: Case-insensitive patterns skip literal prefilters correctly
Empty character class [^\S\s] matches instead of failing #88: Empty character classes ([^\S\s]) no longer match empty strings
Split with empty pattern returns extra empty strings #90: Empty pattern Split matches stdlib behavior
Negated Unicode property class \P{Han}+ matches incorrectly #91: Negated Unicode property classes (\P{Han}) use proper UTF-8 automata

100% stdlib API Compatibility

All stdlib regexp methods implemented:

CompilePOSIX, MustCompilePOSIX - POSIX ERE semantics
Match, MatchString, MatchReader - package-level functions
SubexpIndex(name) - named capture group lookup
LiteralPrefix() - literal prefix extraction with anchor handling
Expand, ExpandString - template substitution ($0, $1, etc.)
Copy() - regex duplication (deprecated since Go 1.12)
MarshalText, UnmarshalText - encoding.TextMarshaler interface
MatchReader, FindReaderIndex, FindReaderSubmatchIndex - io.RuneReader methods

Changes

nfa/compile.go: UTF-8 range compilation for Unicode classes
regex.go: +288 lines - all missing stdlib API methods
regex_stdlib_compat_test.go: New - 16 compatibility tests
docs/STDLIB_COMPATIBILITY.md: Updated to 100% API coverage
CHANGELOG.md, ROADMAP.md: v0.10.7 documentation

Test plan

All existing tests pass (go test ./...)
New stdlib compatibility tests pass
Linter passes (0 issues)
UTF-8 regression tests pass
Fuzz tests against stdlib pass

Fixes

Closes #85, #87, #88, #90, #91

- Fixed '.' to match UTF-8 codepoints instead of bytes - Implemented compileUTF8Any() for proper 1-4 byte UTF-8 sequences - Fixed NumSubexp() to match stdlib behavior (exclude group 0) - Added comprehensive stdlib compatibility tests (1800+ lines) - Added regression tests for UTF-8 dot matching - Added fuzz tests comparing against stdlib - Updated STDLIB_COMPATIBILITY.md documentation Fixes #85

github-actions · 2026-01-14T22:46:50Z

Benchmark Comparison

Comparing main → PR #86

Summary: geomean 243.4n 252.1n +3.55%

⚠️ Potential regressions detected:

Accelerate/memchr1-4       108.1n ± ∞ ¹   120.6n ± ∞ ¹  +11.56% (p=0.008 n=5)
Accelerate/memchr2-4       209.6n ± ∞ ¹   232.4n ± ∞ ¹  +10.88% (p=0.008 n=5)
geomean                    243.4n         252.1n         +3.55%
geomean                               ³                +0.00%               ³
geomean                               ³                +0.00%               ³
geomean                         ³                +0.00%               ³
geomean                         ³                +0.00%               ³
AhoCorasickLargeInput/coregex_Find_64KB-4               111.3µ ± ∞ ¹    111.6µ ± ∞ ¹   +0.29% (p=0.032 n=5)
AhoCorasickManyPatterns/stdlib_25_patterns-4            146.1n ± ∞ ¹    146.9n ± ∞ ¹   +0.55% (p=0.008 n=5)
BranchDispatch_Stdlib/Digits-4                          133.8n ± ∞ ¹    135.3n ± ∞ ¹   +1.12% (p=0.024 n=5)

Full results available in workflow artifacts. CI runners have ~10-20% variance.
For accurate benchmarks, run locally: ./scripts/bench.sh --compare

- Fix case-insensitive literal extraction (#87): skip FoldCase patterns in prefilter - Fix empty character class matching (#88): compileNoMatch() for [^\S\s] - Fix empty pattern Split (#90): handle zero-width matches at boundaries - Add regression tests for all fixes Fixes #87, #88, #90

- Replace incorrect single-byte matching with proper UTF-8 sequences - Build UTF-8 automata for each Unicode range (1/2/3/4-byte) - Handle surrogate gap (U+D800-U+DFFF) correctly - Optimize: use efficient 'any UTF-8' for simple negated ASCII classes - Add comprehensive tests for \P{Han}, \P{Latin}, etc. Before: \P{Han} on 中 returned 3 matches (one per byte) After: \P{Han} on 中 returns 0 matches (correct)

- Add v0.10.7 section with all fixes (#85, #87, #88, #90, #91) - Update STDLIB_COMPATIBILITY: remove fixed issues from known differences - Reduce known differences from 7 to 3 (4 fixed in this release) - Update version compatibility table

- Update current version to v0.10.7 - Add v0.10.6 and v0.10.7 to release history - Mark UTF-8 Automata Optimization as partial (OPT-014) - Update release strategy flow

- Rename 'Fully Compatible' to 'Implemented Functions' - Add 'Not Yet Implemented' section with planned methods - Update compatibility level to '90%+' to be accurate - Reader-based methods, Expand, SubexpIndex, serialization not yet impl

Implement all remaining stdlib regexp methods: - CompilePOSIX, MustCompilePOSIX (POSIX ERE semantics) - Package-level Match, MatchString, MatchReader - SubexpIndex(name) for named capture groups - LiteralPrefix() with anchor handling - Expand, ExpandString template substitution - Copy() (deprecated since Go 1.12) - MarshalText, UnmarshalText (encoding.TextMarshaler) - MatchReader, FindReaderIndex, FindReaderSubmatchIndex Add comprehensive stdlib compatibility tests. Update documentation to reflect 100% API coverage.

kolkov added 6 commits January 15, 2026 02:07

docs: update ROADMAP for v0.10.7 release

0046ab1

- Update current version to v0.10.7 - Add v0.10.6 and v0.10.7 to release history - Mark UTF-8 Automata Optimization as partial (OPT-014) - Update release strategy flow

kolkov changed the title ~~fix: UTF-8 codepoint matching for dot metacharacter (#85)~~ feat: v0.10.7/v0.10.8 - UTF-8 fixes + 100% stdlib API compatibility Jan 15, 2026

style: format regex_stdlib_compat_test.go

b617111

kolkov force-pushed the feature/utf8-dot-fix-85 branch from db7a91a to b617111 Compare January 15, 2026 00:08

kolkov changed the title ~~feat: v0.10.7/v0.10.8 - UTF-8 fixes + 100% stdlib API compatibility~~ feat: v0.10.7 - UTF-8 fixes + 100% stdlib API compatibility Jan 15, 2026

kolkov merged commit 5549ba0 into main Jan 15, 2026
15 checks passed

kolkov deleted the feature/utf8-dot-fix-85 branch January 15, 2026 00:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: v0.10.7 - UTF-8 fixes + 100% stdlib API compatibility #86

feat: v0.10.7 - UTF-8 fixes + 100% stdlib API compatibility #86

Uh oh!

kolkov commented Jan 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: v0.10.7 - UTF-8 fixes + 100% stdlib API compatibility #86

feat: v0.10.7 - UTF-8 fixes + 100% stdlib API compatibility #86

Uh oh!

Conversation

kolkov commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

UTF-8/Unicode Fixes

100% stdlib API Compatibility

Changes

Test plan

Fixes

Uh oh!

github-actions bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Comparison

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kolkov commented Jan 14, 2026 •

edited

Loading

github-actions bot commented Jan 14, 2026 •

edited

Loading