Skip to content

Conversation

@lemire
Copy link
Member

@lemire lemire commented Feb 11, 2025

fixes: #599

Effectively, we seek to provide JavaScript's toWellFormed.

I am using wikipedia_mars/arabic.utf16.txt (see https://github.com/lemire/unicode_lipsum#) as a reference. It is a correct file so we are basically benchmarking a copy (with checks).

Tests on an Apple M2 with LLVM 15:

$  sudo ./build/benchmarks/benchmark -P run_to_well_formed_utf16 -F ../unicode_lipsum/wikipedia_mars/arabic.utf16.txt
Password:
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
Using ICU version 76.1
Using iconv version 267
Compiler: Clang 15.0.0
SIMDUTF version: 6.4.2
System: arm64
===========================
testcases: 1
input detected as UTF16 little-endian
===========================
run_to_well_formed_utf16+arm64, input size: 849690, iterations: 30000, dataset: ../unicode_lipsum/wikipedia_mars/arabic.utf16.txt
   0.446 ins/byte,    0.131 cycle/byte,   27.783 GB/s (3.2 %),     3.637 GHz,    3.410 ins/cycle 
   0.893 ins/char,    0.262 cycle/char,   13.892 Gc/s (3.2 %)     2.00 byte/char  30583.0 ns
run_to_well_formed_utf16+fallback, input size: 849690, iterations: 30000, dataset: ../unicode_lipsum/wikipedia_mars/arabic.utf16.txt
   8.017 ins/byte,    1.172 cycle/byte,    3.001 GB/s (7.1 %),     3.518 GHz,    6.838 ins/cycle 
  16.033 ins/char,    2.345 cycle/char,    1.501 Gc/s (7.1 %)     2.00 byte/char 283125.0 ns

Tests on an Intel Ice Lake with GCC 12:

$ ./buildrelease/benchmarks/benchmark -P run_to_well_formed_utf16 -F unicode_lipsum/wikipedia_mars/arabic.utf16.txt 
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
Using ICU version 67.1
Compiler: GCC 12.2.1
SIMDUTF version: 6.4.2
System: icelake
===========================
testcases: 1
input detected as UTF16 little-endian
===========================
run_to_well_formed_utf16+fallback, input size: 849690, iterations: 3000, dataset: unicode_lipsum/wikipedia_mars/arabic.utf16.txt
   8.000 ins/byte,    1.502 cycle/byte,    2.126 GB/s (16.9 %),     3.193 GHz,    5.326 ins/cycle 
  16.001 ins/char,    3.004 cycle/char,    1.063 Gc/s (16.9 %)     2.00 byte/char 399723.0 ns
WARNING: Measurements are noisy, try increasing iteration count (-I).
run_to_well_formed_utf16+haswell, input size: 849690, iterations: 3000, dataset: unicode_lipsum/wikipedia_mars/arabic.utf16.txt
   0.375 ins/byte,    0.173 cycle/byte,   18.444 GB/s (1.2 %),     3.199 GHz,    2.164 ins/cycle 
   0.751 ins/char,    0.347 cycle/char,    9.222 Gc/s (1.2 %)     2.00 byte/char  46069.0 ns
run_to_well_formed_utf16+icelake, input size: 849690, iterations: 3000, dataset: unicode_lipsum/wikipedia_mars/arabic.utf16.txt
   0.203 ins/byte,    0.185 cycle/byte,   16.750 GB/s (1.2 %),     3.099 GHz,    1.100 ins/cycle 
   0.407 ins/char,    0.370 cycle/char,    8.375 Gc/s (1.2 %)     2.00 byte/char  50727.0 ns
run_to_well_formed_utf16+westmere, input size: 849690, iterations: 3000, dataset: unicode_lipsum/wikipedia_mars/arabic.utf16.txt
   1.000 ins/byte,    0.226 cycle/byte,   14.177 GB/s (1.1 %),     3.198 GHz,    4.435 ins/cycle 
   2.001 ins/char,    0.451 cycle/char,    7.088 Gc/s (1.1 %)     2.00 byte/char  59935.0 ns

Interestingly, the icelake function comes up slightly under the haswell one. The result is robust (switching to clang does not change this observation). But the difference is small so I am tempted to ignore the issue for now.

Joint work with @clausecker

@lemire lemire force-pushed the utf16_well_formed branch 2 times, most recently from 44ea7be to a6a4fd2 Compare April 9, 2025 19:46
@lemire lemire force-pushed the utf16_well_formed branch from e383e03 to d23f2a6 Compare April 9, 2025 23:32
@lemire lemire requested a review from WojciechMula April 11, 2025 03:38
Copy link
Collaborator

@WojciechMula WojciechMula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work! Hats off. I had some comments.

@lemire lemire merged commit 4fb8de7 into master Apr 15, 2025
77 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UTF-16 to UTF-16 with replacement

3 participants