Skip to content

Conversation

@lemire
Copy link
Member

@lemire lemire commented Nov 21, 2025

We are preparing a breaking change following @erikcorry's recent PR on the new functions (utf8_length_from_utf16_with_replacement).

Compared to Erik's version, I made the documentation more explicit since this function will behave differently from the rest of the library.

erikcorry and others added 7 commits November 20, 2025 21:09
…860)

The next step after utf8_length_from_utf16_with_replacement
is almost always going to be to allocate a UTF-8 buffer and
then convert the string. Sadly, we have to insert a third
pass, to_well_formed_utf16, which converts the unpaired
surrogates.

Since surrogates are relatively rare, and the _with_replacement
functions have already scanned the input, we could skip the
conversion if we were given this information along with the
utf-8 length.

In my measurements on Icelake this doesn't slow down
utf8_length_from_utf16_with_replacement at all.
@lemire lemire merged commit 94fb52e into master Nov 22, 2025
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants