-
Notifications
You must be signed in to change notification settings - Fork 3
Benchmark results
Below are benchmark results on a few processors.
Sqrt, div, combination 1 and 2 corresponds to the following 4 functions. The types of arguments, sqrt function and the return values of functions are changed according to the indicated type. See tester/microbench.cpp for the source code.
bool func0(double w, double x, double y, double z) { // sqrt
return w * sqrt(x) + y > z;
}
bool func1(double a, double b, double c, double d) { // div
return a / b + c > d;
}
bool func2(double a, double b, double c, double d) { // combination 1
return a / b + sqrt(c + 1.1) < d / (a + 1.2) + (b + 1.3) / c;
}
bool func3(double a0, double a1, double a2, double a3) { // combination 2
return sqrt(a3 / sqrt(a2 / sqrt(a1 / sqrt(1.1 / a0)))) < 1.1;
}
In order to build the microbenchmarking code, the following options are used. Compilation is carried out on the target computer.
clang-10 -O3 -march=native -ffast-math -S example.c
clang-10 -Xclang -load -Xclang libMathPeephole.so -O3 -march=native -ffast-math -S example.c
Latency w/o and throughput w/o are the results without using the transform plugin. Latency w/ and throughput w/ are the results with the plugin.
Latency is the latency of executing the function in nano sec, which means that the indicated time period is needed in order for a function to finish calculating the return value.
Throughput is the reciprocal throughput of executing each function in nano sec, which means that the function can be executed once per the indicated time period.
One execution of a SIMD version of function returns multiple values at a time.
Smaller values are better.
| Latency w/o | Latency w/ | Throughput w/o | Throughput w/ | |
|---|---|---|---|---|
| Scalar double sqrt | 10.8974 | 7.07204 | 2.37942 | 1.63928 |
| Scalar double div | 8.16066 | 6.25515 | 1.36006 | 1.29232 |
| Scalar double combination 1 | 12.7868 | 11.1754 | 5.1786 | 2.92174 |
| Scalar double combination 2 | 39.46 | 9.25341 | 14.995 | 2.29033 |
| Scalar float sqrt | 10.328 | 7.06584 | 1.54236 | 1.7094 |
| Scalar float div | 7.34748 | 6.25499 | 0.95567 | 1.207 |
| Scalar float combination 1 | 12.1938 | 11.1764 | 2.97841 | 2.92405 |
| Scalar float combination 2 | 27.4484 | 9.25973 | 9.90441 | 2.28873 |
| V2 double sqrt | 8.41558 | 5.02199 | 2.44916 | 1.60365 |
| V2 double div | 8.41483 | 5.03158 | 1.36246 | 1.10354 |
| V2 double combination 1 | 11.2803 | 8.77854 | 6.66632 | 2.53325 |
| V2 double combination 2 | 37.7507 | 7.6074 | 15.1751 | 2.66222 |
| V4 float sqrt | 7.87997 | 4.97866 | 1.56829 | 1.59725 |
| V4 float div | 7.87941 | 4.98164 | 1.22538 | 0.737157 |
| V4 float combination 1 | 9.53807 | 8.28124 | 3.8414 | 2.65078 |
| V4 float combination 2 | 25.5575 | 7.61317 | 11.4465 | 2.66919 |
| V4 double sqrt | 8.4247 | 7.31245 | 2.53925 | 2.55821 |
| V4 double div | 6.06497 | 3.71661 | 2.18469 | 2.19789 |
| V4 double combination 1 | 11.3289 | 8.75939 | 6.69132 | 2.55606 |
| V4 double combination 2 | 37.7666 | 10.5493 | 15.2794 | 4.25895 |
| V8 float sqrt | 8.13279 | 7.31029 | 1.62507 | 2.57191 |
| V8 float div | 6.80213 | 3.74614 | 1.32429 | 0.863688 |
| V8 float combination 1 | 9.25603 | 8.22829 | 3.84619 | 2.70795 |
| V8 float combination 2 | 25.7601 | 10.6259 | 11.436 | 4.32672 |
| Latency w/o | Latency w/ | Throughput w/o | Throughput w/ | |
|---|---|---|---|---|
| Scalar double sqrt | 9.83333 | 6.94472 | 1.58385 | 2.10879 |
| Scalar double div | 8.29314 | 6.69722 | 1.43114 | 1.76826 |
| Scalar double combination 1 | 11.7486 | 11.6426 | 3.80272 | 4.25498 |
| Scalar double combination 2 | 35.428 | 9.25952 | 10.0344 | 3.36071 |
| Scalar float sqrt | 7.0636 | 5.88225 | 1.76003 | 1.70523 |
| Scalar float div | 6.79956 | 5.74564 | 1.03727 | 1.39237 |
| Scalar float combination 1 | 8.85469 | 11.1432 | 2.51212 | 4.03364 |
| Scalar float combination 2 | 23.8257 | 9.06618 | 5.50221 | 2.92968 |
| V2 double sqrt | 8.40408 | 6.99574 | 1.5047 | 2.17567 |
| V2 double div | 8.40506 | 6.99687 | 1.07378 | 1.09575 |
| V2 double combination 1 | 10.3536 | 9.56395 | 4.76477 | 2.25697 |
| V2 double combination 2 | 34.9246 | 10.4645 | 10.033 | 3.09594 |
| V4 float sqrt | 6.90264 | 6.62587 | 0.86425 | 1.73259 |
| V4 float div | 6.90431 | 6.62504 | 0.931655 | 1.27627 |
| V4 float combination 1 | 8.50001 | 8.43789 | 2.65377 | 1.80429 |
| V4 float combination 2 | 21.7596 | 10.3026 | 5.35041 | 2.72136 |
| V4 double sqrt | 8.40729 | 8.70547 | 3.00989 | 3.10562 |
| V4 double div | 7.45602 | 5.82322 | 2.00891 | 1.59794 |
| V4 double combination 1 | 14.1625 | 9.55264 | 9.02738 | 3.01147 |
| V4 double combination 2 | 34.9217 | 12.1894 | 20.0653 | 4.55949 |
| V8 float sqrt | 6.91123 | 8.6802 | 1.58282 | 2.63874 |
| V8 float div | 8.19075 | 5.80181 | 1.30103 | 0.800619 |
| V8 float combination 1 | 8.49971 | 8.43436 | 2.66911 | 1.86872 |
| V8 float combination 2 | 21.7696 | 12.1922 | 5.42351 | 4.32093 |
| Latency w/o | Latency w/ | Throughput w/o | Throughput w/ | |
|---|---|---|---|---|
| Scalar double sqrt | 16.5441 | 14.023 | 5.05278 | 5.83582 |
| Scalar double div | 14.0182 | 11.8617 | 3.73546 | 4.57744 |
| Scalar double combination 1 | 26.9567 | 21.208 | 15.8273 | 9.72597 |
| Scalar double combination 2 | 53.7198 | 16.8929 | 34.5079 | 7.56289 |
| Scalar float sqrt | 16.5306 | 12.7593 | 4.67547 | 5.68942 |
| Scalar float div | 11.5015 | 10.431 | 3.32477 | 4.41451 |
| Scalar float combination 1 | 19.7651 | 19.5915 | 9.72312 | 8.86888 |
| Scalar float combination 2 | 42.7687 | 15.452 | 19.0683 | 6.80609 |
| V2 double sqrt | 18.3472 | 12.1431 | 9.37445 | 6.59464 |
| V2 double div | 18.3512 | 12.142 | 6.49753 | 3.8851 |
| V2 double combination 1 | 36.3182 | 18.345 | 28.7839 | 9.3791 |
| V2 double combination 2 | 79.7839 | 14.7554 | 63.2677 | 8.67484 |
| V4 float sqrt | 14.0528 | 10.1099 | 5.12676 | 6.42582 |
| V4 float div | 14.0522 | 10.1122 | 4.36621 | 3.90253 |
| V4 float combination 1 | 16.9448 | 16.5692 | 11.4089 | 8.74964 |
| V4 float combination 2 | 40.9969 | 14.4236 | 24.6578 | 8.76855 |
On x86 architectures, reciprocal approximation instructions are used for single-precision division with -ffast-math option. Because of this, single-precision results are not improved so much as double-precision results.
Ryzen 9 3900X and Core i5-8400 have fully pipelined and vectorized division and sqrt units, while those units are not fully vectorized on Pentium J5005. Division and sqrt are very fast on Core i5-8400, but still the proposed plugin improves the latency pretty much. On other two processors, we see clear improvement of latency by the plugin, and it improves throughput in most of the cases.