Skip to content

Fix UTF8 escapes in character classes#106

Merged
jaynetics merged 1 commit into
ammar:masterfrom
Earlopain:fix-utf8-escapes-in-sets
Sep 15, 2025
Merged

Fix UTF8 escapes in character classes#106
jaynetics merged 1 commit into
ammar:masterfrom
Earlopain:fix-utf8-escapes-in-sets

Conversation

@Earlopain

Copy link
Copy Markdown
Contributor

Closes #104, I think I found the fix myself 💪

Otherwise one character gets split up into its individual bytes.

Comment thread lib/regexp_parser/scanner/scanner.rl Outdated
# (This currently includes \^, \-, \&, \:, although these could potentially
# be meta chars when not escaped, depending on their position in the set.)
any > (escaped_set_alpha, 1) {
any > (escaped_set_alpha, 1) | utf8_multibyte > (escaped_alpha, 1) {

@Earlopain Earlopain Aug 27, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also works but no tests fail. I mimicked it over to utf8_multibyte for consistency but I'm not really sure what it is supposed to do:

Suggested change
any > (escaped_set_alpha, 1) | utf8_multibyte > (escaped_alpha, 1) {
any | utf8_multibyte {

In fact, ragel doesn't seem to produce different output with or without it

Otherwise one character gets split up into its individual bytes
@Earlopain Earlopain force-pushed the fix-utf8-escapes-in-sets branch from bc5f0e4 to 2efa904 Compare August 27, 2025 14:24
@jaynetics jaynetics merged commit a611c88 into ammar:master Sep 15, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UTF8-escaped characters mess up locations when they are part of a character class

2 participants