Skip ASCII check until the non ASCII byte is found #81

lpinca · 2021-08-12T07:39:21Z

If the 8 + 8 bytes ASCII check is unsuccessful, then there is no reason
to repeat it until the non-ASCII byte is found.

If the 8 + 8 bytes ASCII check is unsuccessful, then there is no reason to repeat it until the non-ASCII byte is found.

lemire · 2021-08-12T14:56:49Z

What are your benchmark numbers supporting this optimization?

By how much do you speed up the processing... please include diverse data sources...

lpinca · 2021-08-12T17:20:41Z

Master branch

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/data/data.txt 
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 1
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 10485760, iterations: 2000, dataset: /Users/luigi/data/data.txt
   1.064 GB/s (0.6 %)    0.997 Gc/s     1.07 byte/char

This PR

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/data/data.txt 
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 1
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 10485760, iterations: 2000, dataset: /Users/luigi/data/data.txt
   2.542 GB/s (1.5 %)    2.383 Gc/s     1.07 byte/char

The dataset is a 10 MiB file made of aaaaaaaaaaaaaa© chunks.

Will run more benchmarks with real world dataset later.

lemire · 2021-08-12T17:27:14Z

I am concerned about such a synthetic dataset.

Would you try again with the files from this repository ?

https://github.com/lemire/unicode_lipsum

lpinca · 2021-08-12T17:30:54Z

Wikipedia Japanese main page (https://ja.wikipedia.org/wiki/メインページ):

Master branch

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/data/wikipedia.ja.html 
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 1
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 98576, iterations: 2000, dataset: /Users/luigi/data/wikipedia.ja.html
   2.204 GB/s (1.9 %)    1.938 Gc/s     1.14 byte/char

This PR

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/data/wikipedia.ja.html 
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 1
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 98576, iterations: 2000, dataset: /Users/luigi/data/wikipedia.ja.html
   3.377 GB/s (2.6 %)    2.969 Gc/s     1.14 byte/char

lpinca · 2021-08-12T17:32:49Z

Would you try again with the files from this repository ?

All of them?

lpinca · 2021-08-12T17:43:46Z

From a quick look almost all files in that repository with the exception of Russian-Lipsum.utf8.txt and Latin-Lipsum.utf8.txt are made exclusively of multi bytes characters so it doesn't make much sense but will do.

Nvm, I did not see the wikipedia_mars folder. Will use the files in that folder.

lpinca · 2021-08-12T18:10:49Z

Master branch

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/unicode_lipsum/wikipedia_mars/*.html
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 18
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 954430, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/arabic.html
   1.588 GB/s (2.1 %)    1.388 Gc/s     1.14 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 382079, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/chinese.html
   2.031 GB/s (2.3 %)    1.787 Gc/s     1.14 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 368442, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/czech.html
   2.346 GB/s (2.7 %)    2.287 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 1005060, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/english.html
  12.123 GB/s (2.2 %)   12.088 Gc/s     1.00 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 192461, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/esperanto.html
   4.831 GB/s (3.6 %)    4.757 Gc/s     1.02 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 1032638, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/french.html
   3.584 GB/s (2.9 %)    3.539 Gc/s     1.01 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 397376, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/german.html
   4.850 GB/s (2.9 %)    4.793 Gc/s     1.01 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 326722, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/greek.html
   1.693 GB/s (2.7 %)    1.482 Gc/s     1.14 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 327412, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/hebrew.html
   1.320 GB/s (2.4 %)    1.136 Gc/s     1.16 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 712465, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/hindi.html
   1.585 GB/s (1.9 %)    1.301 Gc/s     1.22 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 304786, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/japanese.html
   1.790 GB/s (2.3 %)    1.509 Gc/s     1.19 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 193001, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/korean.html
   1.559 GB/s (2.4 %)    1.346 Gc/s     1.16 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 293677, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/persan.html
   1.608 GB/s (2.8 %)    1.425 Gc/s     1.13 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 692409, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/portuguese.html
   4.270 GB/s (2.6 %)    4.225 Gc/s     1.01 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 713817, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/russian.html
   1.421 GB/s (2.2 %)    1.217 Gc/s     1.17 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 1088085, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/thai.html
   1.972 GB/s (2.0 %)    1.618 Gc/s     1.22 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 387007, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/turkish.html
   2.319 GB/s (2.8 %)    2.259 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 674255, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/vietnamese.html
   1.614 GB/s (2.2 %)    1.523 Gc/s     1.06 byte/char

This PR

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/unicode_lipsum/wikipedia_mars/*.html
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 18
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 954430, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/arabic.html
   1.763 GB/s (2.4 %)    1.541 Gc/s     1.14 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 382079, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/chinese.html
   2.908 GB/s (2.7 %)    2.559 Gc/s     1.14 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 368442, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/czech.html
   2.869 GB/s (5.5 %)    2.797 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 1005060, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/english.html
  16.369 GB/s (3.1 %)   16.322 Gc/s     1.00 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 192461, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/esperanto.html
   6.678 GB/s (3.3 %)    6.575 Gc/s     1.02 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 1032638, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/french.html
   4.593 GB/s (3.4 %)    4.535 Gc/s     1.01 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 397376, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/german.html
   6.486 GB/s (2.6 %)    6.411 Gc/s     1.01 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 326722, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/greek.html
   1.907 GB/s (2.9 %)    1.670 Gc/s     1.14 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 327412, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/hebrew.html
   1.457 GB/s (2.8 %)    1.254 Gc/s     1.16 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 712465, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/hindi.html
   2.229 GB/s (2.4 %)    1.830 Gc/s     1.22 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 304786, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/japanese.html
   2.589 GB/s (2.7 %)    2.183 Gc/s     1.19 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 193001, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/korean.html
   2.118 GB/s (3.4 %)    1.829 Gc/s     1.16 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 293677, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/persan.html
   1.803 GB/s (3.0 %)    1.598 Gc/s     1.13 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 692409, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/portuguese.html
   5.480 GB/s (3.1 %)    5.422 Gc/s     1.01 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 713817, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/russian.html
   1.612 GB/s (3.4 %)    1.380 Gc/s     1.17 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 1088085, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/thai.html
   2.971 GB/s (2.1 %)    2.437 Gc/s     1.22 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 387007, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/turkish.html
   2.819 GB/s (5.3 %)    2.746 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 674255, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/vietnamese.html
   2.034 GB/s (4.0 %)    1.920 Gc/s     1.06 byte/char

lpinca · 2021-08-12T18:18:26Z

Master branch

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/unicode_lipsum/wikipedia_mars/*.utf8.txt 
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 18
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 533857, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/arabic.utf8.txt
   1.020 GB/s (2.2 %)    0.812 Gc/s     1.26 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 181321, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/chinese.utf8.txt
   1.114 GB/s (2.7 %)    0.843 Gc/s     1.32 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 152721, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/czech.utf8.txt
   1.141 GB/s (3.3 %)    1.074 Gc/s     1.06 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 390368, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/english.utf8.txt
   8.678 GB/s (3.1 %)    8.614 Gc/s     1.01 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 86963, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/esperanto.utf8.txt
   2.886 GB/s (3.5 %)    2.792 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 446908, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/french.utf8.txt
   1.926 GB/s (2.4 %)    1.874 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 205779, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/german.utf8.txt
   2.982 GB/s (2.5 %)    2.916 Gc/s     1.02 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 181348, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/greek.utf8.txt
   1.032 GB/s (2.7 %)    0.814 Gc/s     1.27 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 190114, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/hebrew.utf8.txt
   0.841 GB/s (2.2 %)    0.648 Gc/s     1.30 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 396593, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/hindi.utf8.txt
   0.971 GB/s (2.0 %)    0.671 Gc/s     1.45 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 164355, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/japanese.utf8.txt
   1.078 GB/s (2.3 %)    0.780 Gc/s     1.38 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 97859, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/korean.utf8.txt
   0.889 GB/s (2.6 %)    0.662 Gc/s     1.34 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 156209, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/persan.utf8.txt
   0.973 GB/s (2.5 %)    0.777 Gc/s     1.25 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 280660, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/portuguese.utf8.txt
   2.189 GB/s (3.2 %)    2.134 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 407095, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/russian.utf8.txt
   0.913 GB/s (2.3 %)    0.700 Gc/s     1.30 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 593589, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/thai.utf8.txt
   1.199 GB/s (1.8 %)    0.818 Gc/s     1.47 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 195078, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/turkish.utf8.txt
   1.333 GB/s (2.8 %)    1.267 Gc/s     1.05 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 319029, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/vietnamese.utf8.txt
   0.837 GB/s (2.2 %)    0.741 Gc/s     1.13 byte/char

This PR

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/unicode_lipsum/wikipedia_mars/*.utf8.txt
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 18
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 533857, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/arabic.utf8.txt
   1.111 GB/s (2.4 %)    0.884 Gc/s     1.26 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 181321, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/chinese.utf8.txt
   1.602 GB/s (2.5 %)    1.212 Gc/s     1.32 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 152721, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/czech.utf8.txt
   1.383 GB/s (5.6 %)    1.303 Gc/s     1.06 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 390368, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/english.utf8.txt
  12.315 GB/s (3.9 %)   12.225 Gc/s     1.01 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 86963, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/esperanto.utf8.txt
   4.225 GB/s (4.8 %)    4.088 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 446908, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/french.utf8.txt
   2.449 GB/s (4.1 %)    2.383 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 205779, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/german.utf8.txt
   4.003 GB/s (3.3 %)    3.914 Gc/s     1.02 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 181348, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/greek.utf8.txt
   1.158 GB/s (3.0 %)    0.913 Gc/s     1.27 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 190114, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/hebrew.utf8.txt
   0.921 GB/s (2.7 %)    0.709 Gc/s     1.30 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 396593, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/hindi.utf8.txt
   1.363 GB/s (2.1 %)    0.941 Gc/s     1.45 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 164355, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/japanese.utf8.txt
   1.582 GB/s (2.6 %)    1.145 Gc/s     1.38 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 97859, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/korean.utf8.txt
   1.226 GB/s (3.1 %)    0.914 Gc/s     1.34 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 156209, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/persan.utf8.txt
   1.074 GB/s (2.7 %)    0.857 Gc/s     1.25 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 280660, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/portuguese.utf8.txt
   2.763 GB/s (3.4 %)    2.694 Gc/s     1.03 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 407095, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/russian.utf8.txt
   1.023 GB/s (3.0 %)    0.784 Gc/s     1.30 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 593589, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/thai.utf8.txt
   1.831 GB/s (1.9 %)    1.249 Gc/s     1.47 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 195078, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/turkish.utf8.txt
   1.624 GB/s (6.0 %)    1.544 Gc/s     1.05 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 319029, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/wikipedia_mars/vietnamese.utf8.txt
   1.058 GB/s (4.6 %)    0.936 Gc/s     1.13 byte/char

lpinca · 2021-08-12T18:31:31Z

Master branch

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/unicode_lipsum/lipsum/*.utf8.txt
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 9
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 81685, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Arabic-Lipsum.utf8.txt
   0.657 GB/s (3.0 %)    0.368 Gc/s     1.78 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 69840, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Chinese-Lipsum.utf8.txt
   0.778 GB/s (1.6 %)    0.261 Gc/s     2.98 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 65542, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Emoji-Lipsum.utf8.txt
   0.985 GB/s (1.7 %)    0.246 Gc/s     4.00 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 66495, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Hebrew-Lipsum.utf8.txt
   0.646 GB/s (3.3 %)    0.362 Gc/s     1.78 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 87997, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Hindi-Lipsum.utf8.txt
   0.632 GB/s (1.9 %)    0.235 Gc/s     2.69 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 67808, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Japanese-Lipsum.utf8.txt
   0.748 GB/s (6.2 %)    0.258 Gc/s     2.90 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 66600, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Korean-Lipsum.utf8.txt
   0.780 GB/s (4.5 %)    0.318 Gc/s     2.45 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 86940, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Latin-Lipsum.utf8.txt
  16.576 GB/s (7.9 %)   16.576 Gc/s     1.00 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 104770, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Russian-Lipsum.utf8.txt
   0.591 GB/s (2.4 %)    0.327 Gc/s     1.81 byte/char

This PR

$ ./build/benchmarks/benchmark -P validate_utf8+fallback -F ~/unicode_lipsum/lipsum/*.utf8.txt
We define the number of bytes to be the number of *input* bytes.
We define a 'char' to be a code point (between 1 and 4 bytes).
===========================
testcases: 9
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 81685, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Arabic-Lipsum.utf8.txt
   0.642 GB/s (3.4 %)    0.360 Gc/s     1.78 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 69840, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Chinese-Lipsum.utf8.txt
   1.256 GB/s (2.3 %)    0.422 Gc/s     2.98 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 65542, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Emoji-Lipsum.utf8.txt
   1.196 GB/s (1.4 %)    0.299 Gc/s     4.00 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 66495, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Hebrew-Lipsum.utf8.txt
   0.626 GB/s (3.0 %)    0.351 Gc/s     1.78 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 87997, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Hindi-Lipsum.utf8.txt
   0.928 GB/s (2.3 %)    0.345 Gc/s     2.69 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 67808, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Japanese-Lipsum.utf8.txt
   1.198 GB/s (1.6 %)    0.413 Gc/s     2.90 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 66600, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Korean-Lipsum.utf8.txt
   1.150 GB/s (3.0 %)    0.469 Gc/s     2.45 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 86940, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Latin-Lipsum.utf8.txt
  22.083 GB/s (1.1 %)   22.083 Gc/s     1.00 byte/char 
input detected as UTF8
current system detected as haswell
===========================
validate_utf8+fallback, input size: 104770, iterations: 2000, dataset: /Users/luigi/unicode_lipsum/lipsum/Russian-Lipsum.utf8.txt
   0.641 GB/s (2.4 %)    0.355 Gc/s     1.81 byte/char

lemire · 2021-08-12T21:52:53Z

Thanks. It does look convincing. Here are my own results (Apple M1, LLVM 12):

file	main branch	PR
Arabic-Lipsum.utf8.txt	1.476 GB/s	1.634 GB/s
Chinese-Lipsum.utf8.txt	1.700 GB/s	1.789 GB/s
Emoji-Lipsum.utf8.txt	2.017 GB/s	1.921 GB/s
Hebrew-Lipsum.utf8.txt	1.446 GB/s	1.656 GB/s
Hebrew-Lipsum.utf8.txt	1.237 GB/s	1.277 GB/s
Japanese-Lipsum.utf8.txt	1.569 GB/s	1.603 GB/s
Korean-Lipsum.utf8.txt	1.700 GB/s	1.867 GB/s
Latin-Lipsum.utf8.txt	17.247 GB/s	25.451 GB/s
Russian-Lipsum.utf8.txt	0.975 GB/s	0.975 GB/s

It is a bit surprising at first that it would help even with pure ASCII files, but it makes sense.

Merging.

Skip ASCII check until the non ASCII byte is found

0cbb098

If the 8 + 8 bytes ASCII check is unsuccessful, then there is no reason to repeat it until the non-ASCII byte is found.

lpinca force-pushed the skip/ascii-check branch from 918186d to 0cbb098 Compare August 12, 2021 11:54

lemire merged commit 1c90dd9 into simdutf:master Aug 12, 2021

lpinca deleted the skip/ascii-check branch August 13, 2021 04:53

lpinca mentioned this pull request Sep 3, 2024

benchmark: add isUtf8 and isAscii bench nodejs/node#54740

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Skip ASCII check until the non ASCII byte is found #81

Skip ASCII check until the non ASCII byte is found #81

Uh oh!

lpinca commented Aug 12, 2021 •

edited

Loading

Uh oh!

lemire commented Aug 12, 2021

Uh oh!

lpinca commented Aug 12, 2021

Uh oh!

lemire commented Aug 12, 2021

Uh oh!

lpinca commented Aug 12, 2021 •

edited

Loading

Uh oh!

lpinca commented Aug 12, 2021

Uh oh!

lpinca commented Aug 12, 2021 •

edited

Loading

Uh oh!

lpinca commented Aug 12, 2021

Uh oh!

lpinca commented Aug 12, 2021

Uh oh!

lpinca commented Aug 12, 2021

Uh oh!

lemire commented Aug 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Skip ASCII check until the non ASCII byte is found #81

Skip ASCII check until the non ASCII byte is found #81

Uh oh!

Conversation

lpinca commented Aug 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Aug 12, 2021

Uh oh!

lpinca commented Aug 12, 2021

Uh oh!

lemire commented Aug 12, 2021

Uh oh!

lpinca commented Aug 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lpinca commented Aug 12, 2021

Uh oh!

lpinca commented Aug 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lpinca commented Aug 12, 2021

Uh oh!

lpinca commented Aug 12, 2021

Uh oh!

lpinca commented Aug 12, 2021

Uh oh!

lemire commented Aug 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lpinca commented Aug 12, 2021 •

edited

Loading

lpinca commented Aug 12, 2021 •

edited

Loading

lpinca commented Aug 12, 2021 •

edited

Loading