Return more information from utf8_length_from_utf16_with_replacement #860

erikcorry · 2025-11-20T10:55:44Z

The next step after utf8_length_from_utf16_with_replacement is almost always going to be to allocate a UTF-8 buffer and then convert the string. Sadly, we have to insert a third pass, to_well_formed_utf16, which converts the unpaired surrogates.

Since surrogates are relatively rare, and the _with_replacement functions have already scanned the input, we could skip the conversion if we were given this information along with the utf-8 length.

In my measurements on Icelake this doesn't slow down utf8_length_from_utf16_with_replacement at all.

The next step after utf8_length_from_utf16_with_replacement is almost always going to be to allocate a UTF-8 buffer and then convert the string. Sadly, we have to insert a third pass, to_well_formed_utf16, which converts the unpaired surrogates. Since surrogates are relatively rare, and the _with_replacement functions have already scanned the input, we could skip the conversion if we were given this information along with the utf-8 length. In my measurements on Icelake this doesn't slow down utf8_length_from_utf16_with_replacement at all.

lemire · 2025-11-20T19:24:04Z

I think that the only user is @anonrig, if so, I think we can allow a breaking change.

anonrig

perfect

lemire · 2025-11-21T02:06:39Z

This PR is good and approved by @anonrig but I will merge it in a dev branch and do a bit of additional checks before our release.

* Return more information from utf8_length_from_utf16_with_replacement (#860) The next step after utf8_length_from_utf16_with_replacement is almost always going to be to allocate a UTF-8 buffer and then convert the string. Sadly, we have to insert a third pass, to_well_formed_utf16, which converts the unpaired surrogates. Since surrogates are relatively rare, and the _with_replacement functions have already scanned the input, we could skip the conversion if we were given this information along with the utf-8 length. In my measurements on Icelake this doesn't slow down utf8_length_from_utf16_with_replacement at all. * lint * better documentation. * version bump. * [no-ci] minor simplification * correct macro name. (!!!) * removing silly space --------- Co-authored-by: Erik Corry <erik@arbat.com> Co-authored-by: Daniel Lemire <dlemire@lemire.me>

erikcorry mentioned this pull request Nov 20, 2025

utf8_length_from_utf16_with_replacement should return information on the presence of surrogates #861

Closed

anonrig approved these changes Nov 20, 2025

View reviewed changes

lemire changed the base branch from master to devutf8length November 21, 2025 02:09

lemire merged commit 7bc54a6 into simdutf:devutf8length Nov 21, 2025
47 of 51 checks passed

BrewTestBot mentioned this pull request Nov 22, 2025

simdutf 7.7.0 Homebrew/homebrew-core#255486

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Return more information from utf8_length_from_utf16_with_replacement #860

Return more information from utf8_length_from_utf16_with_replacement #860

erikcorry commented Nov 20, 2025

Uh oh!

lemire commented Nov 20, 2025

Uh oh!

anonrig left a comment

Uh oh!

lemire commented Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Return more information from utf8_length_from_utf16_with_replacement #860

Return more information from utf8_length_from_utf16_with_replacement #860

Conversation

erikcorry commented Nov 20, 2025

Uh oh!

lemire commented Nov 20, 2025

Uh oh!

anonrig left a comment

Choose a reason for hiding this comment

Uh oh!

lemire commented Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants