Skip to content

Conversation

@erikcorry
Copy link
Collaborator

The next step after utf8_length_from_utf16_with_replacement is almost always going to be to allocate a UTF-8 buffer and then convert the string. Sadly, we have to insert a third pass, to_well_formed_utf16, which converts the unpaired surrogates.

Since surrogates are relatively rare, and the _with_replacement functions have already scanned the input, we could skip the conversion if we were given this information along with the utf-8 length.

In my measurements on Icelake this doesn't slow down utf8_length_from_utf16_with_replacement at all.

The next step after utf8_length_from_utf16_with_replacement
is almost always going to be to allocate a UTF-8 buffer and
then convert the string. Sadly, we have to insert a third
pass, to_well_formed_utf16, which converts the unpaired
surrogates.

Since surrogates are relatively rare, and the _with_replacement
functions have already scanned the input, we could skip the
conversion if we were given this information along with the
utf-8 length.

In my measurements on Icelake this doesn't slow down
utf8_length_from_utf16_with_replacement at all.
@lemire
Copy link
Member

lemire commented Nov 20, 2025

I think that the only user is @anonrig, if so, I think we can allow a breaking change.

Copy link
Member

@anonrig anonrig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect

@lemire
Copy link
Member

lemire commented Nov 21, 2025

This PR is good and approved by @anonrig but I will merge it in a dev branch and do a bit of additional checks before our release.

@lemire lemire changed the base branch from master to devutf8length November 21, 2025 02:09
@lemire lemire merged commit 7bc54a6 into simdutf:devutf8length Nov 21, 2025
47 of 51 checks passed
lemire added a commit that referenced this pull request Nov 22, 2025
* Return more information from utf8_length_from_utf16_with_replacement (#860)

The next step after utf8_length_from_utf16_with_replacement
is almost always going to be to allocate a UTF-8 buffer and
then convert the string. Sadly, we have to insert a third
pass, to_well_formed_utf16, which converts the unpaired
surrogates.

Since surrogates are relatively rare, and the _with_replacement
functions have already scanned the input, we could skip the
conversion if we were given this information along with the
utf-8 length.

In my measurements on Icelake this doesn't slow down
utf8_length_from_utf16_with_replacement at all.

* lint

* better documentation.

* version bump.

* [no-ci] minor simplification

* correct macro name. (!!!)

* removing silly space

---------

Co-authored-by: Erik Corry <erik@arbat.com>
Co-authored-by: Daniel Lemire <dlemire@lemire.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants