Skip to content

Conversation

@lemire
Copy link
Member

@lemire lemire commented Nov 14, 2025

This adds functions of the type utf8_length_from_utf16_with_replacement: they compute how many bytes of UTF-8 data you would need to convert the (potentially invalid) UTF-16 input, where unpaired surrogates are turned into the replacement character. Observe that we do not yet support the conversion operation itself.

The functions utf8_length_from_utf16_with_replacement are fast, but slower than utf8_length_from_utf16 because we need to effectively validate the content, so have a double burden: validation and counting. It should still be quite fast.

We only support ARM + x64. I will open additional issues later.

I have also added an experimental (still buggy) script to help us add new functions to simdutf so that future work is faster. I have also tuned a couple of utf8_length_from_utf16 implementations : AVX-512 and ARM have now faster functions. There are few other minor changes.

Another change is that we make match_system a constexpr function.

fixes #849

@anonrig
Copy link
Member

anonrig commented Nov 14, 2025

I'll review it in a couple of hours. Looks really good.

This comment was marked as resolved.

This comment was marked as resolved.

@anonrig
Copy link
Member

anonrig commented Nov 15, 2025

My main question would be: What's the performance characteristics of the following code in incorrect and correct utf16 inputs?

if (simdutf::validate_utf16()) 
  return simdutf:: utf8_length_from_utf16()

return simdutf:: utf8_length_from_utf16_with_replacement()

vs

return simdutf:: utf8_length_from_utf16_with_replacement()

@lemire
Copy link
Member Author

lemire commented Nov 15, 2025

@anonrig I think that's the optimization I applied: we use the no-surrogate case as a happy path. :-)

lemire and others added 4 commits November 15, 2025 14:18
Co-authored-by: Paul Dreik <github@pauldreik.se>
Co-authored-by: Paul Dreik <github@pauldreik.se>
Co-authored-by: Paul Dreik <github@pauldreik.se>
Co-authored-by: Paul Dreik <github@pauldreik.se>
@lemire lemire requested review from Copilot and pauldreik November 15, 2025 21:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

// for remaining_next_mask, we start from 0x7FFFFFFFULL instead of
// 0xFFFFFFFFULL We could also just shift right by one remaining_mask.
__mmask32 remaining_next_mask = 0x7FFFFFFFULL >> (32 - (size - pos));
__m512i input = _mm512_maskz_loadu_epi16(remaining_mask, in + pos);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this load doesn't segfault on strings near page edges. Perhaps I'm overlooking something, but I think you should step back by the number of positions the mask was shifted by and shift the mask the other way. Since we are in the large-input function we can't segfault at the beginning by doing this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikcorry Masked loads in AVX-512 do not segfaults.

Daniel Lemire, "Modern vector programming with masked loads and stores," in Daniel Lemire's blog, November 8, 2022, https://lemire.me/blog/2022/11/08/modern-vector-programming-with-masked-loads-and-stores/.

See also:

https://stackoverflow.com/questions/54497141/when-using-a-mask-register-with-avx-512-load-and-stores-is-a-fault-raised-for-i

}
__m512i input_next =
_mm512_maskz_loadu_epi16(remaining_next_mask, in + pos + 1);
if (!match_system(big_endian)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem here, but more?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Masked loads do not segfaults if the mask only loads valid bytes.

Copy link
Collaborator

@erikcorry erikcorry Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's amazing! (Claude lied to me on this!)

@erikcorry
Copy link
Collaborator

An implementation of my idea, and I think it also fixes the out-of-bounds read:
#857

using vector_u16 = simd16<uint16_t>;
constexpr size_t N = vector_u16::ELEMENTS;
constexpr size_t N = vector_u16::ELEMENTS; // 32 on AVX-512
if (N + 1 > size) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative way to handle this would be to check whether the first and last characters were in the same 32-character (64 byte) cache line:
size_t aligned_start = (size_t)in & 0xffff'ffff'ffff'ffc0;
if (aligned_start == (((size_t)(in + size - 1)) & 0xffff'ffff'ffff'ffc0) {
// Calculate mask for actual string contents.
_m512i input1 = _mm512_maskz_loadu_epi16(contents_mask, aligned_start);
// No more loads here.
return length;
}

This would probably be faster than the current fallback for cases with a large number of small strings. (Can probably just reuse the last-few-characters code at the end of the current function, just with a different load address and mask.)

Once you have detected this case you can still go to the regular implementation below knowing that the final load that gets the last char can't fault, because the string, however small, spans two 64 byte cache lines, both of which can't fault.

Disadvantage is you have to mark it so that sanitizers don't get annoyed with the OOB read, but I think you already have this issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rumour has it that IceLake is slow if you just use AVX-512 once because it has to wake up that section of the chip, but that newer Intel CPUs AMD's Zen4 doesn't have that issue. So this might not pay off everywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikcorry

You are correct that for small strings, we could do the full input in AVX-512, and given that we have established that mask loads are safe, it is not difficult... But I think you will agree that this would requiring performance testing.

I think we can make such changes in future PRs.

@lemire
Copy link
Member Author

lemire commented Nov 17, 2025

@erikcorry

An implementation of my idea, and I think it also fixes the out-of-bounds read:

Thanks.

I still want to merge this PR this week, but given Eric's work, I will delay a bit the merge.

@lemire
Copy link
Member Author

lemire commented Nov 18, 2025

The implementations can be further improved, but I think that this is good enough as an initial implementation. Merging. Release soon after.

@lemire lemire merged commit 006c083 into master Nov 18, 2025
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

adding a new method: utf8_length_from_invalid_utf16

5 participants