-
Notifications
You must be signed in to change notification settings - Fork 116
UTF16 to UTF8 length with replacement #851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'll review it in a couple of hours. Looks really good. |
|
My main question would be: What's the performance characteristics of the following code in incorrect and correct utf16 inputs? if (simdutf::validate_utf16())
return simdutf:: utf8_length_from_utf16()
return simdutf:: utf8_length_from_utf16_with_replacement()vs return simdutf:: utf8_length_from_utf16_with_replacement() |
|
@anonrig I think that's the optimization I applied: we use the no-surrogate case as a happy path. :-) |
Co-authored-by: Paul Dreik <github@pauldreik.se>
Co-authored-by: Paul Dreik <github@pauldreik.se>
Co-authored-by: Paul Dreik <github@pauldreik.se>
Co-authored-by: Paul Dreik <github@pauldreik.se>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
| // for remaining_next_mask, we start from 0x7FFFFFFFULL instead of | ||
| // 0xFFFFFFFFULL We could also just shift right by one remaining_mask. | ||
| __mmask32 remaining_next_mask = 0x7FFFFFFFULL >> (32 - (size - pos)); | ||
| __m512i input = _mm512_maskz_loadu_epi16(remaining_mask, in + pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this load doesn't segfault on strings near page edges. Perhaps I'm overlooking something, but I think you should step back by the number of positions the mask was shifted by and shift the mask the other way. Since we are in the large-input function we can't segfault at the beginning by doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erikcorry Masked loads in AVX-512 do not segfaults.
Daniel Lemire, "Modern vector programming with masked loads and stores," in Daniel Lemire's blog, November 8, 2022, https://lemire.me/blog/2022/11/08/modern-vector-programming-with-masked-loads-and-stores/.
See also:
| } | ||
| __m512i input_next = | ||
| _mm512_maskz_loadu_epi16(remaining_next_mask, in + pos + 1); | ||
| if (!match_system(big_endian)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same problem here, but more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Masked loads do not segfaults if the mask only loads valid bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's amazing! (Claude lied to me on this!)
|
An implementation of my idea, and I think it also fixes the out-of-bounds read: |
| using vector_u16 = simd16<uint16_t>; | ||
| constexpr size_t N = vector_u16::ELEMENTS; | ||
| constexpr size_t N = vector_u16::ELEMENTS; // 32 on AVX-512 | ||
| if (N + 1 > size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative way to handle this would be to check whether the first and last characters were in the same 32-character (64 byte) cache line:
size_t aligned_start = (size_t)in & 0xffff'ffff'ffff'ffc0;
if (aligned_start == (((size_t)(in + size - 1)) & 0xffff'ffff'ffff'ffc0) {
// Calculate mask for actual string contents.
_m512i input1 = _mm512_maskz_loadu_epi16(contents_mask, aligned_start);
// No more loads here.
return length;
}
This would probably be faster than the current fallback for cases with a large number of small strings. (Can probably just reuse the last-few-characters code at the end of the current function, just with a different load address and mask.)
Once you have detected this case you can still go to the regular implementation below knowing that the final load that gets the last char can't fault, because the string, however small, spans two 64 byte cache lines, both of which can't fault.
Disadvantage is you have to mark it so that sanitizers don't get annoyed with the OOB read, but I think you already have this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rumour has it that IceLake is slow if you just use AVX-512 once because it has to wake up that section of the chip, but that newer Intel CPUs AMD's Zen4 doesn't have that issue. So this might not pay off everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct that for small strings, we could do the full input in AVX-512, and given that we have established that mask loads are safe, it is not difficult... But I think you will agree that this would requiring performance testing.
I think we can make such changes in future PRs.
Thanks. I still want to merge this PR this week, but given Eric's work, I will delay a bit the merge. |
* Alternative strategy for UTF-8 length from malformed UTF-16 * Don't expect any surrogates, skip work in this case
|
The implementations can be further improved, but I think that this is good enough as an initial implementation. Merging. Release soon after. |
This adds functions of the type
utf8_length_from_utf16_with_replacement: they compute how many bytes of UTF-8 data you would need to convert the (potentially invalid) UTF-16 input, where unpaired surrogates are turned into the replacement character. Observe that we do not yet support the conversion operation itself.The functions
utf8_length_from_utf16_with_replacementare fast, but slower thanutf8_length_from_utf16because we need to effectively validate the content, so have a double burden: validation and counting. It should still be quite fast.We only support ARM + x64. I will open additional issues later.
I have also added an experimental (still buggy) script to help us add new functions to simdutf so that future work is faster. I have also tuned a couple of
utf8_length_from_utf16implementations : AVX-512 and ARM have now faster functions. There are few other minor changes.Another change is that we make match_system a constexpr function.
fixes #849