Skip to content

Conversation

@kolkov
Copy link
Contributor

@kolkov kolkov commented Jan 14, 2026

Summary

v0.10.7 delivers UTF-8/Unicode fixes and complete stdlib API compatibility.

UTF-8/Unicode Fixes

100% stdlib API Compatibility

All stdlib regexp methods implemented:

  • CompilePOSIX, MustCompilePOSIX - POSIX ERE semantics
  • Match, MatchString, MatchReader - package-level functions
  • SubexpIndex(name) - named capture group lookup
  • LiteralPrefix() - literal prefix extraction with anchor handling
  • Expand, ExpandString - template substitution ($0, $1, etc.)
  • Copy() - regex duplication (deprecated since Go 1.12)
  • MarshalText, UnmarshalText - encoding.TextMarshaler interface
  • MatchReader, FindReaderIndex, FindReaderSubmatchIndex - io.RuneReader methods

Changes

  • nfa/compile.go: UTF-8 range compilation for Unicode classes
  • regex.go: +288 lines - all missing stdlib API methods
  • regex_stdlib_compat_test.go: New - 16 compatibility tests
  • docs/STDLIB_COMPATIBILITY.md: Updated to 100% API coverage
  • CHANGELOG.md, ROADMAP.md: v0.10.7 documentation

Test plan

  • All existing tests pass (go test ./...)
  • New stdlib compatibility tests pass
  • Linter passes (0 issues)
  • UTF-8 regression tests pass
  • Fuzz tests against stdlib pass

Fixes

Closes #85, #87, #88, #90, #91

- Fixed '.' to match UTF-8 codepoints instead of bytes
- Implemented compileUTF8Any() for proper 1-4 byte UTF-8 sequences
- Fixed NumSubexp() to match stdlib behavior (exclude group 0)
- Added comprehensive stdlib compatibility tests (1800+ lines)
- Added regression tests for UTF-8 dot matching
- Added fuzz tests comparing against stdlib
- Updated STDLIB_COMPATIBILITY.md documentation

Fixes #85
@github-actions
Copy link

github-actions bot commented Jan 14, 2026

Benchmark Comparison

Comparing main → PR #86

Summary: geomean 243.4n 252.1n +3.55%

⚠️ Potential regressions detected:

Accelerate/memchr1-4       108.1n ± ∞ ¹   120.6n ± ∞ ¹  +11.56% (p=0.008 n=5)
Accelerate/memchr2-4       209.6n ± ∞ ¹   232.4n ± ∞ ¹  +10.88% (p=0.008 n=5)
geomean                    243.4n         252.1n         +3.55%
geomean                               ³                +0.00%               ³
geomean                               ³                +0.00%               ³
geomean                         ³                +0.00%               ³
geomean                         ³                +0.00%               ³
AhoCorasickLargeInput/coregex_Find_64KB-4               111.3µ ± ∞ ¹    111.6µ ± ∞ ¹   +0.29% (p=0.032 n=5)
AhoCorasickManyPatterns/stdlib_25_patterns-4            146.1n ± ∞ ¹    146.9n ± ∞ ¹   +0.55% (p=0.008 n=5)
BranchDispatch_Stdlib/Digits-4                          133.8n ± ∞ ¹    135.3n ± ∞ ¹   +1.12% (p=0.024 n=5)

Full results available in workflow artifacts. CI runners have ~10-20% variance.
For accurate benchmarks, run locally: ./scripts/bench.sh --compare

- Fix case-insensitive literal extraction (#87): skip FoldCase patterns in prefilter
- Fix empty character class matching (#88): compileNoMatch() for [^\S\s]
- Fix empty pattern Split (#90): handle zero-width matches at boundaries
- Add regression tests for all fixes

Fixes #87, #88, #90
- Replace incorrect single-byte matching with proper UTF-8 sequences
- Build UTF-8 automata for each Unicode range (1/2/3/4-byte)
- Handle surrogate gap (U+D800-U+DFFF) correctly
- Optimize: use efficient 'any UTF-8' for simple negated ASCII classes
- Add comprehensive tests for \P{Han}, \P{Latin}, etc.

Before: \P{Han} on 中 returned 3 matches (one per byte)
After: \P{Han} on 中 returns 0 matches (correct)
- Add v0.10.7 section with all fixes (#85, #87, #88, #90, #91)
- Update STDLIB_COMPATIBILITY: remove fixed issues from known differences
- Reduce known differences from 7 to 3 (4 fixed in this release)
- Update version compatibility table
- Update current version to v0.10.7
- Add v0.10.6 and v0.10.7 to release history
- Mark UTF-8 Automata Optimization as partial (OPT-014)
- Update release strategy flow
- Rename 'Fully Compatible' to 'Implemented Functions'
- Add 'Not Yet Implemented' section with planned methods
- Update compatibility level to '90%+' to be accurate
- Reader-based methods, Expand, SubexpIndex, serialization not yet impl
Implement all remaining stdlib regexp methods:
- CompilePOSIX, MustCompilePOSIX (POSIX ERE semantics)
- Package-level Match, MatchString, MatchReader
- SubexpIndex(name) for named capture groups
- LiteralPrefix() with anchor handling
- Expand, ExpandString template substitution
- Copy() (deprecated since Go 1.12)
- MarshalText, UnmarshalText (encoding.TextMarshaler)
- MatchReader, FindReaderIndex, FindReaderSubmatchIndex

Add comprehensive stdlib compatibility tests.
Update documentation to reflect 100% API coverage.
@kolkov kolkov changed the title fix: UTF-8 codepoint matching for dot metacharacter (#85) feat: v0.10.7/v0.10.8 - UTF-8 fixes + 100% stdlib API compatibility Jan 15, 2026
@kolkov kolkov force-pushed the feature/utf8-dot-fix-85 branch from db7a91a to b617111 Compare January 15, 2026 00:08
@kolkov kolkov changed the title feat: v0.10.7/v0.10.8 - UTF-8 fixes + 100% stdlib API compatibility feat: v0.10.7 - UTF-8 fixes + 100% stdlib API compatibility Jan 15, 2026
@kolkov kolkov merged commit 5549ba0 into main Jan 15, 2026
15 checks passed
@kolkov kolkov deleted the feature/utf8-dot-fix-85 branch January 15, 2026 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Dot metacharacter matches bytes instead of UTF-8 codepoints

2 participants