UTF16 to UTF8 length with replacement #851

lemire · 2025-11-14T23:06:39Z

This adds functions of the type utf8_length_from_utf16_with_replacement: they compute how many bytes of UTF-8 data you would need to convert the (potentially invalid) UTF-16 input, where unpaired surrogates are turned into the replacement character. Observe that we do not yet support the conversion operation itself.

The functions utf8_length_from_utf16_with_replacement are fast, but slower than utf8_length_from_utf16 because we need to effectively validate the content, so have a double burden: validation and counting. It should still be quite fast.

We only support ARM + x64. I will open additional issues later.

I have also added an experimental (still buggy) script to help us add new functions to simdutf so that future work is faster. I have also tuned a couple of utf8_length_from_utf16 implementations : AVX-512 and ARM have now faster functions. There are few other minor changes.

Another change is that we make match_system a constexpr function.

fixes #849

anonrig · 2025-11-14T23:18:17Z

I'll review it in a couple of hours. Looks really good.

anonrig · 2025-11-15T00:30:19Z

My main question would be: What's the performance characteristics of the following code in incorrect and correct utf16 inputs?

if (simdutf::validate_utf16()) 
  return simdutf:: utf8_length_from_utf16()

return simdutf:: utf8_length_from_utf16_with_replacement()

vs

return simdutf:: utf8_length_from_utf16_with_replacement()

src/arm64/arm_convert_utf16_to_utf8.cpp

src/generic/utf16/utf8_length_from_utf16_bytemask.h

src/icelake/icelake_utf8_length_from_utf16.inl.cpp

src/arm64/arm_convert_utf16_to_utf8.cpp

src/icelake/icelake_utf8_length_from_utf16.inl.cpp

src/generic/utf16/utf8_length_from_utf16_bytemask.h

src/icelake/icelake_utf8_length_from_utf16.inl.cpp

lemire · 2025-11-15T03:31:06Z

@anonrig I think that's the optimization I applied: we use the no-surrogate case as a happy path. :-)

scripts/README_ADD_FUNCTION.md

CONTRIBUTING.md

include/simdutf/implementation.h

scripts/README_ADD_FUNCTION.md

Co-authored-by: Paul Dreik <github@pauldreik.se>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

src/icelake/icelake_utf8_length_from_utf16.inl.cpp

erikcorry · 2025-11-17T08:25:07Z

src/icelake/icelake_utf8_length_from_utf16.inl.cpp

+  // for remaining_next_mask, we start from 0x7FFFFFFFULL instead of
+  // 0xFFFFFFFFULL We could also just shift right by one remaining_mask.
+  __mmask32 remaining_next_mask = 0x7FFFFFFFULL >> (32 - (size - pos));
+  __m512i input = _mm512_maskz_loadu_epi16(remaining_mask, in + pos);


I don't understand how this load doesn't segfault on strings near page edges. Perhaps I'm overlooking something, but I think you should step back by the number of positions the mask was shifted by and shift the mask the other way. Since we are in the large-input function we can't segfault at the beginning by doing this.

@erikcorry Masked loads in AVX-512 do not segfaults.

Daniel Lemire, "Modern vector programming with masked loads and stores," in Daniel Lemire's blog, November 8, 2022, https://lemire.me/blog/2022/11/08/modern-vector-programming-with-masked-loads-and-stores/.

See also:

https://stackoverflow.com/questions/54497141/when-using-a-mask-register-with-avx-512-load-and-stores-is-a-fault-raised-for-i

erikcorry · 2025-11-17T08:26:22Z

src/icelake/icelake_utf8_length_from_utf16.inl.cpp

+  }
+  __m512i input_next =
+      _mm512_maskz_loadu_epi16(remaining_next_mask, in + pos + 1);
+  if (!match_system(big_endian)) {


Same problem here, but more?

Masked loads do not segfaults if the mask only loads valid bytes.

That's amazing! (Claude lied to me on this!)

erikcorry · 2025-11-17T09:35:20Z

An implementation of my idea, and I think it also fixes the out-of-bounds read:
#857

erikcorry · 2025-11-17T14:09:57Z

src/icelake/icelake_utf8_length_from_utf16.inl.cpp

  using vector_u16 = simd16<uint16_t>;
-  constexpr size_t N = vector_u16::ELEMENTS;
+  constexpr size_t N = vector_u16::ELEMENTS; // 32 on AVX-512
+  if (N + 1 > size) {


An alternative way to handle this would be to check whether the first and last characters were in the same 32-character (64 byte) cache line:
size_t aligned_start = (size_t)in & 0xffff'ffff'ffff'ffc0;
if (aligned_start == (((size_t)(in + size - 1)) & 0xffff'ffff'ffff'ffc0) {
// Calculate mask for actual string contents.
_m512i input1 = _mm512_maskz_loadu_epi16(contents_mask, aligned_start);
// No more loads here.
return length;
}

This would probably be faster than the current fallback for cases with a large number of small strings. (Can probably just reuse the last-few-characters code at the end of the current function, just with a different load address and mask.)

Once you have detected this case you can still go to the regular implementation below knowing that the final load that gets the last char can't fault, because the string, however small, spans two 64 byte cache lines, both of which can't fault.

Disadvantage is you have to mark it so that sanitizers don't get annoyed with the OOB read, but I think you already have this issue.

Rumour has it that IceLake is slow if you just use AVX-512 once because it has to wake up that section of the chip, but that newer Intel CPUs AMD's Zen4 doesn't have that issue. So this might not pay off everywhere.

@erikcorry

You are correct that for small strings, we could do the full input in AVX-512, and given that we have established that mask loads are safe, it is not difficult... But I think you will agree that this would requiring performance testing.

I think we can make such changes in future PRs.

lemire · 2025-11-17T17:35:45Z

@erikcorry

An implementation of my idea, and I think it also fixes the out-of-bounds read:

Thanks.

I still want to merge this PR this week, but given Eric's work, I will delay a bit the merge.

* Alternative strategy for UTF-8 length from malformed UTF-16 * Don't expect any surrogates, skip work in this case

lemire · 2025-11-18T15:30:45Z

The implementations can be further improved, but I think that this is good enough as an initial implementation. Merging. Release soon after.

lemire and others added 11 commits November 10, 2025 22:31

init

870eab9

adding tests.

e6af22d

initial impl.

ad74302

adding comment.

c1e1e85

format

cd0f80b

haswell and westmere

8dd011f

implemented icelake

b462f2c

speeding up icelake

7afed5e

done with icelake

f6202cd

better documentation.

68a9be5

fixing portability issue with Windows

df9993b

lemire requested review from anonrig and Copilot November 14, 2025 23:12

anonrig changed the title ~~UTF6 to UTF8 length with replacement~~ UTF16 to UTF8 length with replacement Nov 14, 2025

This was referenced Nov 14, 2025

Add Loongson versions of add UTF16 to UTF8 length with replacement #853 #854

Open

Add RVV versions of add UTF16 to UTF8 length with replacement #853

Open

add UTF16 to UTF8 length with replacement to fuzzer #852

Open

lemire mentioned this pull request Nov 14, 2025

improve text encoder encode performance cloudflare/workerd#5448

Open

This comment was marked as resolved.

Sign in to view

lemire requested a review from Copilot November 14, 2025 23:33

Copilot started reviewing on behalf of lemire November 14, 2025 23:46 View session

Copilot finished reviewing on behalf of lemire November 14, 2025 23:49

This comment was marked as resolved.

Sign in to view

Copilot started reviewing on behalf of lemire November 15, 2025 00:16 View session

Copilot finished reviewing on behalf of lemire November 15, 2025 00:18

got the name of the intrinsic wrong.

a8fc8be

anonrig reviewed Nov 15, 2025

View reviewed changes

src/generic/utf16/utf8_length_from_utf16_bytemask.h Outdated Show resolved Hide resolved

src/icelake/icelake_utf8_length_from_utf16.inl.cpp Show resolved Hide resolved

fixing other missed opportunities

34a0ffa

fixing the cast

6a673a5

pauldreik reviewed Nov 15, 2025

View reviewed changes

lemire and others added 4 commits November 15, 2025 14:18

Update scripts/README_ADD_FUNCTION.md

eca1c60

Co-authored-by: Paul Dreik <github@pauldreik.se>

Update CONTRIBUTING.md

f3d0473

Co-authored-by: Paul Dreik <github@pauldreik.se>

Update CONTRIBUTING.md

fd01f7a

Co-authored-by: Paul Dreik <github@pauldreik.se>

Update scripts/README_ADD_FUNCTION.md

e57be99

Co-authored-by: Paul Dreik <github@pauldreik.se>

anonrig approved these changes Nov 15, 2025

View reviewed changes

lemire added 4 commits November 15, 2025 14:42

correcting feature check.

461bc89

fixing big-endian issue

6bad6c6

lint.

9afaaf0

typo

07906a7

lemire requested review from Copilot and pauldreik November 15, 2025 21:38

Copilot AI reviewed Nov 15, 2025

View reviewed changes

Copilot started reviewing on behalf of lemire November 16, 2025 02:52 View session

Copilot finished reviewing on behalf of lemire November 16, 2025 02:53

erikcorry reviewed Nov 16, 2025

View reviewed changes

src/icelake/icelake_utf8_length_from_utf16.inl.cpp Outdated Show resolved Hide resolved

erikcorry reviewed Nov 17, 2025

View reviewed changes

erikcorry and others added 5 commits November 17, 2025 21:03

Alternative strategy for UTF-8 length from malformed UTF-16 (#857)

b42b794

* Alternative strategy for UTF-8 length from malformed UTF-16 * Don't expect any surrogates, skip work in this case

correct the memcpy

7329990

lint

c0a3e47

more testing and fixing a bug in generic and arm impl.

6415cd7

adding alignment (workaround for bug in some versions of gcc).

255b67e

lemire merged commit 006c083 into master Nov 18, 2025
70 checks passed

UTF16 to UTF8 length with replacement #851

UTF16 to UTF8 length with replacement #851

Uh oh!

Conversation

lemire commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anonrig commented Nov 14, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

anonrig commented Nov 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lemire commented Nov 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

erikcorry Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

lemire Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

erikcorry Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

lemire Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

erikcorry Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erikcorry commented Nov 17, 2025

Uh oh!

erikcorry Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

erikcorry Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

lemire Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

lemire commented Nov 17, 2025

Uh oh!

lemire commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lemire commented Nov 14, 2025 •

edited

Loading

erikcorry Nov 17, 2025 •

edited

Loading