Skip to content

Conversation

@lpinca
Copy link
Contributor

@lpinca lpinca commented Aug 12, 2021

If the 8 + 8 bytes ASCII check is unsuccessful, then there is no reason
to repeat it until the non-ASCII byte is found.

If the 8 + 8 bytes ASCII check is unsuccessful, then there is no reason
to repeat it until the non-ASCII byte is found.
@lemire
Copy link
Member

lemire commented Aug 12, 2021

What are your benchmark numbers supporting this optimization?

By how much do you speed up the processing... please include diverse data sources...

@lpinca
Copy link
Contributor Author

lpinca commented Aug 12, 2021

Master branch

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/data/data.txt 
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 1
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 10485760, iterations: 2000, dataset: /Users/luigi/data/data.txt
   1.064 GB/s (0.6 %)    0.997 Gc/s     1.07 byte/char

This PR

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/data/data.txt 
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 1
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 10485760, iterations: 2000, dataset: /Users/luigi/data/data.txt
   2.542 GB/s (1.5 %)    2.383 Gc/s     1.07 byte/char

The dataset is a 10 MiB file made of aaaaaaaaaaaaaa© chunks.

Will run more benchmarks with real world dataset later.

@lemire
Copy link
Member

lemire commented Aug 12, 2021

I am concerned about such a synthetic dataset.

Would you try again with the files from this repository ?

https://github.com/lemire/unicode_lipsum

@lpinca
Copy link
Contributor Author

lpinca commented Aug 12, 2021

Wikipedia Japanese main page (https://ja.wikipedia.org/wiki/メインページ):

Master branch

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/data/wikipedia.ja.html 
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 1
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 98576, iterations: 2000, dataset: /Users/luigi/data/wikipedia.ja.html
   2.204 GB/s (1.9 %)    1.938 Gc/s     1.14 byte/char

This PR

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/data/wikipedia.ja.html 
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 1
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 98576, iterations: 2000, dataset: /Users/luigi/data/wikipedia.ja.html
   3.377 GB/s (2.6 %)    2.969 Gc/s     1.14 byte/char

@lpinca
Copy link
Contributor Author

lpinca commented Aug 12, 2021

Would you try again with the files from this repository ?

All of them?

@lpinca
Copy link
Contributor Author

lpinca commented Aug 12, 2021

From a quick look almost all files in that repository with the exception of Russian-Lipsum.utf8.txt and Latin-Lipsum.utf8.txt are made exclusively of multi bytes characters so it doesn't make much sense but will do.

Nvm, I did not see the wikipedia_mars folder. Will use the files in that folder.

@lpinca
Copy link
Contributor Author

lpinca commented Aug 12, 2021

Master branch

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/unicode_lipsum/wikipedia_mars/*.html
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 18
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 954430, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/arabic.html
   1.588 GB/s (2.1 %)    1.388 Gc/s     1.14 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 382079, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/chinese.html
   2.031 GB/s (2.3 %)    1.787 Gc/s     1.14 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 368442, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/czech.html
   2.346 GB/s (2.7 %)    2.287 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 1005060, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/english.html
  12.123 GB/s (2.2 %)   12.088 Gc/s     1.00 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 192461, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/esperanto.html
   4.831 GB/s (3.6 %)    4.757 Gc/s     1.02 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 1032638, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/french.html
   3.584 GB/s (2.9 %)    3.539 Gc/s     1.01 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 397376, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/german.html
   4.850 GB/s (2.9 %)    4.793 Gc/s     1.01 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 326722, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/greek.html
   1.693 GB/s (2.7 %)    1.482 Gc/s     1.14 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 327412, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/hebrew.html
   1.320 GB/s (2.4 %)    1.136 Gc/s     1.16 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 712465, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/hindi.html
   1.585 GB/s (1.9 %)    1.301 Gc/s     1.22 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 304786, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/japanese.html
   1.790 GB/s (2.3 %)    1.509 Gc/s     1.19 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 193001, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/korean.html
   1.559 GB/s (2.4 %)    1.346 Gc/s     1.16 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 293677, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/persan.html
   1.608 GB/s (2.8 %)    1.425 Gc/s     1.13 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 692409, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/portuguese.html
   4.270 GB/s (2.6 %)    4.225 Gc/s     1.01 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 713817, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/russian.html
   1.421 GB/s (2.2 %)    1.217 Gc/s     1.17 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 1088085, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/thai.html
   1.972 GB/s (2.0 %)    1.618 Gc/s     1.22 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 387007, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/turkish.html
   2.319 GB/s (2.8 %)    2.259 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 674255, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/vietnamese.html
   1.614 GB/s (2.2 %)    1.523 Gc/s     1.06 byte/char

This PR

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/unicode_lipsum/wikipedia_mars/*.html
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 18
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 954430, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/arabic.html
   1.763 GB/s (2.4 %)    1.541 Gc/s     1.14 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 382079, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/chinese.html
   2.908 GB/s (2.7 %)    2.559 Gc/s     1.14 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 368442, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/czech.html
   2.869 GB/s (5.5 %)    2.797 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 1005060, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/english.html
  16.369 GB/s (3.1 %)   16.322 Gc/s     1.00 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 192461, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/esperanto.html
   6.678 GB/s (3.3 %)    6.575 Gc/s     1.02 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 1032638, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/french.html
   4.593 GB/s (3.4 %)    4.535 Gc/s     1.01 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 397376, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/german.html
   6.486 GB/s (2.6 %)    6.411 Gc/s     1.01 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 326722, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/greek.html
   1.907 GB/s (2.9 %)    1.670 Gc/s     1.14 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 327412, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/hebrew.html
   1.457 GB/s (2.8 %)    1.254 Gc/s     1.16 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 712465, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/hindi.html
   2.229 GB/s (2.4 %)    1.830 Gc/s     1.22 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 304786, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/japanese.html
   2.589 GB/s (2.7 %)    2.183 Gc/s     1.19 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 193001, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/korean.html
   2.118 GB/s (3.4 %)    1.829 Gc/s     1.16 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 293677, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/persan.html
   1.803 GB/s (3.0 %)    1.598 Gc/s     1.13 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 692409, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/portuguese.html
   5.480 GB/s (3.1 %)    5.422 Gc/s     1.01 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 713817, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/russian.html
   1.612 GB/s (3.4 %)    1.380 Gc/s     1.17 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 1088085, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/thai.html
   2.971 GB/s (2.1 %)    2.437 Gc/s     1.22 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 387007, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/turkish.html
   2.819 GB/s (5.3 %)    2.746 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 674255, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/vietnamese.html
   2.034 GB/s (4.0 %)    1.920 Gc/s     1.06 byte/char

@lpinca
Copy link
Contributor Author

lpinca commented Aug 12, 2021

Master branch

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/unicode_lipsum/wikipedia_mars/*.utf8.txt 
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 18
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 533857, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/arabic.utf8.txt
   1.020 GB/s (2.2 %)    0.812 Gc/s     1.26 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 181321, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/chinese.utf8.txt
   1.114 GB/s (2.7 %)    0.843 Gc/s     1.32 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 152721, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/czech.utf8.txt
   1.141 GB/s (3.3 %)    1.074 Gc/s     1.06 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 390368, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/english.utf8.txt
   8.678 GB/s (3.1 %)    8.614 Gc/s     1.01 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 86963, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/esperanto.utf8.txt
   2.886 GB/s (3.5 %)    2.792 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 446908, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/french.utf8.txt
   1.926 GB/s (2.4 %)    1.874 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 205779, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/german.utf8.txt
   2.982 GB/s (2.5 %)    2.916 Gc/s     1.02 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 181348, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/greek.utf8.txt
   1.032 GB/s (2.7 %)    0.814 Gc/s     1.27 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 190114, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/hebrew.utf8.txt
   0.841 GB/s (2.2 %)    0.648 Gc/s     1.30 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 396593, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/hindi.utf8.txt
   0.971 GB/s (2.0 %)    0.671 Gc/s     1.45 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 164355, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/japanese.utf8.txt
   1.078 GB/s (2.3 %)    0.780 Gc/s     1.38 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 97859, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/korean.utf8.txt
   0.889 GB/s (2.6 %)    0.662 Gc/s     1.34 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 156209, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/persan.utf8.txt
   0.973 GB/s (2.5 %)    0.777 Gc/s     1.25 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 280660, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/portuguese.utf8.txt
   2.189 GB/s (3.2 %)    2.134 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 407095, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/russian.utf8.txt
   0.913 GB/s (2.3 %)    0.700 Gc/s     1.30 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 593589, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/thai.utf8.txt
   1.199 GB/s (1.8 %)    0.818 Gc/s     1.47 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 195078, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/turkish.utf8.txt
   1.333 GB/s (2.8 %)    1.267 Gc/s     1.05 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 319029, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/vietnamese.utf8.txt
   0.837 GB/s (2.2 %)    0.741 Gc/s     1.13 byte/char

This PR

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/unicode_lipsum/wikipedia_mars/*.utf8.txt
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 18
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 533857, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/arabic.utf8.txt
   1.111 GB/s (2.4 %)    0.884 Gc/s     1.26 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 181321, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/chinese.utf8.txt
   1.602 GB/s (2.5 %)    1.212 Gc/s     1.32 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 152721, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/czech.utf8.txt
   1.383 GB/s (5.6 %)    1.303 Gc/s     1.06 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 390368, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/english.utf8.txt
  12.315 GB/s (3.9 %)   12.225 Gc/s     1.01 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 86963, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/esperanto.utf8.txt
   4.225 GB/s (4.8 %)    4.088 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 446908, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/french.utf8.txt
   2.449 GB/s (4.1 %)    2.383 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 205779, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/german.utf8.txt
   4.003 GB/s (3.3 %)    3.914 Gc/s     1.02 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 181348, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/greek.utf8.txt
   1.158 GB/s (3.0 %)    0.913 Gc/s     1.27 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 190114, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/hebrew.utf8.txt
   0.921 GB/s (2.7 %)    0.709 Gc/s     1.30 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 396593, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/hindi.utf8.txt
   1.363 GB/s (2.1 %)    0.941 Gc/s     1.45 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 164355, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/japanese.utf8.txt
   1.582 GB/s (2.6 %)    1.145 Gc/s     1.38 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 97859, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/korean.utf8.txt
   1.226 GB/s (3.1 %)    0.914 Gc/s     1.34 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 156209, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/persan.utf8.txt
   1.074 GB/s (2.7 %)    0.857 Gc/s     1.25 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 280660, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/portuguese.utf8.txt
   2.763 GB/s (3.4 %)    2.694 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 407095, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/russian.utf8.txt
   1.023 GB/s (3.0 %)    0.784 Gc/s     1.30 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 593589, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/thai.utf8.txt
   1.831 GB/s (1.9 %)    1.249 Gc/s     1.47 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 195078, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/turkish.utf8.txt
   1.624 GB/s (6.0 %)    1.544 Gc/s     1.05 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 319029, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/vietnamese.utf8.txt
   1.058 GB/s (4.6 %)    0.936 Gc/s     1.13 byte/char

@lpinca
Copy link
Contributor Author

lpinca commented Aug 12, 2021

Master branch

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/unicode_lipsum/lipsum/*.utf8.txt
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 9
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 81685, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Arabic-Lipsum.utf8.txt
   0.657 GB/s (3.0 %)    0.368 Gc/s     1.78 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 69840, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Chinese-Lipsum.utf8.txt
   0.778 GB/s (1.6 %)    0.261 Gc/s     2.98 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 65542, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Emoji-Lipsum.utf8.txt
   0.985 GB/s (1.7 %)    0.246 Gc/s     4.00 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 66495, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Hebrew-Lipsum.utf8.txt
   0.646 GB/s (3.3 %)    0.362 Gc/s     1.78 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 87997, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Hindi-Lipsum.utf8.txt
   0.632 GB/s (1.9 %)    0.235 Gc/s     2.69 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 67808, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Japanese-Lipsum.utf8.txt
   0.748 GB/s (6.2 %)    0.258 Gc/s     2.90 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 66600, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Korean-Lipsum.utf8.txt
   0.780 GB/s (4.5 %)    0.318 Gc/s     2.45 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 86940, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Latin-Lipsum.utf8.txt
  16.576 GB/s (7.9 %)   16.576 Gc/s     1.00 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 104770, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Russian-Lipsum.utf8.txt
   0.591 GB/s (2.4 %)    0.327 Gc/s     1.81 byte/char

This PR

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/unicode_lipsum/lipsum/*.utf8.txt
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 9
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 81685, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Arabic-Lipsum.utf8.txt
   0.642 GB/s (3.4 %)    0.360 Gc/s     1.78 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 69840, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Chinese-Lipsum.utf8.txt
   1.256 GB/s (2.3 %)    0.422 Gc/s     2.98 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 65542, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Emoji-Lipsum.utf8.txt
   1.196 GB/s (1.4 %)    0.299 Gc/s     4.00 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 66495, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Hebrew-Lipsum.utf8.txt
   0.626 GB/s (3.0 %)    0.351 Gc/s     1.78 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 87997, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Hindi-Lipsum.utf8.txt
   0.928 GB/s (2.3 %)    0.345 Gc/s     2.69 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 67808, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Japanese-Lipsum.utf8.txt
   1.198 GB/s (1.6 %)    0.413 Gc/s     2.90 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 66600, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Korean-Lipsum.utf8.txt
   1.150 GB/s (3.0 %)    0.469 Gc/s     2.45 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 86940, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Latin-Lipsum.utf8.txt
  22.083 GB/s (1.1 %)   22.083 Gc/s     1.00 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 104770, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Russian-Lipsum.utf8.txt
   0.641 GB/s (2.4 %)    0.355 Gc/s     1.81 byte/char

@lemire
Copy link
Member

lemire commented Aug 12, 2021

Thanks. It does look convincing. Here are my own results (Apple M1, LLVM 12):

file main branch PR
Arabic-Lipsum.utf8.txt   1.476 GB/s 1.634 GB/s
Chinese-Lipsum.utf8.txt   1.700 GB/s 1.789 GB/s
Emoji-Lipsum.utf8.txt   2.017 GB/s 1.921 GB/s
Hebrew-Lipsum.utf8.txt   1.446 GB/s 1.656 GB/s
Hebrew-Lipsum.utf8.txt   1.237 GB/s 1.277 GB/s
Japanese-Lipsum.utf8.txt   1.569 GB/s 1.603 GB/s
Korean-Lipsum.utf8.txt   1.700 GB/s 1.867 GB/s
Latin-Lipsum.utf8.txt   17.247 GB/s 25.451 GB/s
Russian-Lipsum.utf8.txt   0.975 GB/s 0.975 GB/s

It is a bit surprising at first that it would help even with pure ASCII files, but it makes sense.

Merging.

@lemire lemire merged commit 1c90dd9 into simdutf:master Aug 12, 2021
@lpinca lpinca deleted the skip/ascii-check branch August 13, 2021 04:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants