Skip to content

Improve Unicode handling in code-frame tokenizer#17589

Merged
JLHwung merged 4 commits intobabel:mainfrom
JLHwung:enable-prefer-string-starts-ends-with
Nov 14, 2025
Merged

Improve Unicode handling in code-frame tokenizer#17589
JLHwung merged 4 commits intobabel:mainfrom
JLHwung:enable-prefer-string-starts-ends-with

Conversation

@JLHwung
Copy link
Contributor

@JLHwung JLHwung commented Nov 7, 2025

Q                       A
Fixed Issues? @babel/code-frame does not correctly tokenize non-BMP capitalized identifiers.
Patch: Bug Fix?
Major: Breaking Change?
Minor: New Feature?
Tests Added + Pass? Yes
Documentation PR Link
Any Dependency Changes?
License MIT

In this PR we enable the prefer-string-starts-ends-with ts-eslint rule and fixes most lint errors.

Then we improve Unicode handling in the code-frame tokenizer. Previously we only test whether string[0] equals to string[0].toLowerCase, apparently this approach does not respect non-BMP characters.

@JLHwung JLHwung added the PR: Bug Fix 🐛 A type of pull request used for our changelog categories label Nov 7, 2025
@babel-bot
Copy link
Collaborator

babel-bot commented Nov 7, 2025

Build successful! You can test your changes in the REPL here: https://babeljs.io/repl/build/60164

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 7, 2025

Open in StackBlitz

commit: a98ee72

@ehoogeveen-medweb
Copy link

Happened upon this and started wondering if you ever need to look at it on a grapheme cluster level to decide the case. Thankfully it seems that in practice, looking at the first code point is always enough.

Another O(1) way to get the first code point in a string is tokenValue[Symbol.iterator]().next().value, but aside from being very ugly I'm not sure it's faster either.

@JLHwung
Copy link
Contributor Author

JLHwung commented Nov 9, 2025

Happened upon this and started wondering if you ever need to look at it on a grapheme cluster level to decide the case. Thankfully it seems that in practice, looking at the first code point is always enough.

Another O(1) way to get the first code point in a string is tokenValue[Symbol.iterator]().next().value, but aside from being very ugly I'm not sure it's faster either.

Thank you. I think it might be overkill to implement UAX 29 as the cluster boundary rules are mostly for Hangul, Arabic or other scripts using ZWJ, and Emoji. None of them have concept of case.

The performance concern is a good point since the identifier handling is a hot path. I will add a fast pass for ASCII characters.

@JLHwung JLHwung requested a review from liuxingbaoyu November 13, 2025 15:04
@JLHwung JLHwung merged commit c92c491 into babel:main Nov 14, 2025
74 checks passed
@JLHwung JLHwung deleted the enable-prefer-string-starts-ends-with branch November 14, 2025 13:03
@github-actions github-actions bot added the outdated A closed issue/PR that is archived due to age. Recommended to make a new issue label Feb 14, 2026
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 14, 2026
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 14, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

outdated A closed issue/PR that is archived due to age. Recommended to make a new issue PR: Bug Fix 🐛 A type of pull request used for our changelog categories

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants