Skip to content

Benchmark results

Naoki Shibata edited this page Oct 28, 2020 · 14 revisions

Below are benchmark results on a few processors.

Sqrt, div, combination 1 and 2 corresponds to the following 4 functions. The types of arguments, sqrt function and the return values of functions are changed according to the indicated type. See tester/microbench.cpp for the source code.

bool func0(double w, double x, double y, double z) { // sqrt
  return w * sqrt(x) + y > z;
}

bool func1(double a, double b, double c, double d) { // div
  return a / b + c > d;
}

bool func2(double a, double b, double c, double d) { // combination 1
  return a / b + sqrt(c + 1.1) < d / (a + 1.2) + (b + 1.3) / c;
}

bool func3(double a0, double a1, double a2, double a3) { // combination 2
  return sqrt(a3 / sqrt(a2 / sqrt(a1 / sqrt(1.1 / a0)))) < 1.1;
}

In order to build the microbenchmarking code, the following options are used. Compilation is carried out on the target computer.

clang-10 -O3 -march=native -ffast-math -S example.c
clang-10 -Xclang -load -Xclang libMathPeephole.so -O3 -march=native -ffast-math -S example.c

Latency w/o and throughput w/o are the results without using the transform plugin. Latency w/ and throughput w/ are the results with the plugin.

Latency is the latency of executing the function in nano sec, which means that the indicated time period is needed in order for a function to finish calculating the return value.

Throughput is the reciprocal throughput of executing each function in nano sec, which means that the function can be executed once per the indicated time period.

One execution of a SIMD version of function returns multiple values at a time.

AMD Ryzen 9 3900X (on VirtualBox)

Smaller values are better.

Latency w/o Latency w/ Throughput w/o Throughput w/
Scalar double sqrt 10.8974 7.07204 2.37942 1.63928
Scalar double div 8.16066 6.25515 1.36006 1.29232
Scalar double combination 1 12.7868 11.1754 5.1786 2.92174
Scalar double combination 2 39.46 9.25341 14.995 2.29033
Scalar float sqrt 10.328 7.06584 1.54236 1.7094
Scalar float div 7.34748 6.25499 0.95567 1.207
Scalar float combination 1 12.1938 11.1764 2.97841 2.92405
Scalar float combination 2 27.4484 9.25973 9.90441 2.28873
V2 double sqrt 8.41558 5.02199 2.44916 1.60365
V2 double div 8.41483 5.03158 1.36246 1.10354
V2 double combination 1 11.2803 8.77854 6.66632 2.53325
V2 double combination 2 37.7507 7.6074 15.1751 2.66222
V4 float sqrt 7.87997 4.97866 1.56829 1.59725
V4 float div 7.87941 4.98164 1.22538 0.737157
V4 float combination 1 9.53807 8.28124 3.8414 2.65078
V4 float combination 2 25.5575 7.61317 11.4465 2.66919
V4 double sqrt 8.4247 7.31245 2.53925 2.55821
V4 double div 6.06497 3.71661 2.18469 2.19789
V4 double combination 1 11.3289 8.75939 6.69132 2.55606
V4 double combination 2 37.7666 10.5493 15.2794 4.25895
V8 float sqrt 8.13279 7.31029 1.62507 2.57191
V8 float div 6.80213 3.74614 1.32429 0.863688
V8 float combination 1 9.25603 8.22829 3.84619 2.70795
V8 float combination 2 25.7601 10.6259 11.436 4.32672

Intel Core i5-8400

Latency w/o Latency w/ Throughput w/o Throughput w/
Scalar double sqrt 9.83333 6.94472 1.58385 2.10879
Scalar double div 8.29314 6.69722 1.43114 1.76826
Scalar double combination 1 11.7486 11.6426 3.80272 4.25498
Scalar double combination 2 35.428 9.25952 10.0344 3.36071
Scalar float sqrt 7.0636 5.88225 1.76003 1.70523
Scalar float div 6.79956 5.74564 1.03727 1.39237
Scalar float combination 1 8.85469 11.1432 2.51212 4.03364
Scalar float combination 2 23.8257 9.06618 5.50221 2.92968
V2 double sqrt 8.40408 6.99574 1.5047 2.17567
V2 double div 8.40506 6.99687 1.07378 1.09575
V2 double combination 1 10.3536 9.56395 4.76477 2.25697
V2 double combination 2 34.9246 10.4645 10.033 3.09594
V4 float sqrt 6.90264 6.62587 0.86425 1.73259
V4 float div 6.90431 6.62504 0.931655 1.27627
V4 float combination 1 8.50001 8.43789 2.65377 1.80429
V4 float combination 2 21.7596 10.3026 5.35041 2.72136
V4 double sqrt 8.40729 8.70547 3.00989 3.10562
V4 double div 7.45602 5.82322 2.00891 1.59794
V4 double combination 1 14.1625 9.55264 9.02738 3.01147
V4 double combination 2 34.9217 12.1894 20.0653 4.55949
V8 float sqrt 6.91123 8.6802 1.58282 2.63874
V8 float div 8.19075 5.80181 1.30103 0.800619
V8 float combination 1 8.49971 8.43436 2.66911 1.86872
V8 float combination 2 21.7696 12.1922 5.42351 4.32093

Intel Pentium J5005

Latency w/o Latency w/ Throughput w/o Throughput w/
Scalar double sqrt 16.5441 14.023 5.05278 5.83582
Scalar double div 14.0182 11.8617 3.73546 4.57744
Scalar double combination 1 26.9567 21.208 15.8273 9.72597
Scalar double combination 2 53.7198 16.8929 34.5079 7.56289
Scalar float sqrt 16.5306 12.7593 4.67547 5.68942
Scalar float div 11.5015 10.431 3.32477 4.41451
Scalar float combination 1 19.7651 19.5915 9.72312 8.86888
Scalar float combination 2 42.7687 15.452 19.0683 6.80609
V2 double sqrt 18.3472 12.1431 9.37445 6.59464
V2 double div 18.3512 12.142 6.49753 3.8851
V2 double combination 1 36.3182 18.345 28.7839 9.3791
V2 double combination 2 79.7839 14.7554 63.2677 8.67484
V4 float sqrt 14.0528 10.1099 5.12676 6.42582
V4 float div 14.0522 10.1122 4.36621 3.90253
V4 float combination 1 16.9448 16.5692 11.4089 8.74964
V4 float combination 2 40.9969 14.4236 24.6578 8.76855

Discussion

On x86 architectures, reciprocal approximation instructions are used for single-precision division with -ffast-math option. Because of this, single-precision results are not improved so much as double-precision results.

Ryzen 9 3900X and Core i5-8400 have fully pipelined and vectorized division and sqrt units, while those units are not fully vectorized on Pentium J5005. Division and sqrt are very fast on Core i5-8400, but still the proposed plugin improves the latency pretty much. On other two processors, we see clear improvement of latency by the plugin, and it improves throughput in most of the cases.

Clone this wiki locally