Improve Unicode handling in code-frame tokenizer#17589
Conversation
|
Build successful! You can test your changes in the REPL here: https://babeljs.io/repl/build/60164 |
|
commit: |
|
Happened upon this and started wondering if you ever need to look at it on a grapheme cluster level to decide the case. Thankfully it seems that in practice, looking at the first code point is always enough. Another O(1) way to get the first code point in a string is |
Thank you. I think it might be overkill to implement UAX 29 as the cluster boundary rules are mostly for Hangul, Arabic or other scripts using ZWJ, and Emoji. None of them have concept of case. The performance concern is a good point since the identifier handling is a hot path. I will add a fast pass for ASCII characters. |
@babel/code-framedoes not correctly tokenize non-BMP capitalized identifiers.In this PR we enable the
prefer-string-starts-ends-withts-eslint rule and fixes most lint errors.Then we improve Unicode handling in the code-frame tokenizer. Previously we only test whether
string[0]equals tostring[0].toLowerCase, apparently this approach does not respect non-BMP characters.