Benchmark results

Below are benchmark results on a few processors.

Sqrt, div, combination 1 and 2 corresponds to the following 4 functions. The types of arguments, sqrt function and the return values of functions are changed according to the indicated type. See tester/microbench.cpp for the source code.

bool func0(double w, double x, double y, double z) { // sqrt
  return w * sqrt(x) + y > z;
}

bool func1(double a, double b, double c, double d) { // div
  return a / b + c > d;
}

bool func2(double a, double b, double c, double d) { // combination 1
  return a / b + sqrt(c + 1.1) < d / (a + 1.2) + (b + 1.3) / c;
}

bool func3(double a0, double a1, double a2, double a3) { // combination 2
  return sqrt(a3 / sqrt(a2 / sqrt(a1 / sqrt(1.1 / a0)))) < 1.1;
}

In order to build the microbenchmarking code, the following options are used. Compilation is carried out on the target computer.

clang-10 -O3 -march=native -ffast-math -S example.c
clang-10 -Xclang -load -Xclang libMathPeephole.so -O3 -march=native -ffast-math -S example.c

Latency w/o and throughput w/o are the results without using the transform plugin. Latency w/ and throughput w/ are the results with the plugin.

Latency is the latency of executing the function in nano sec, which means that the indicated time period is needed in order for a function to finish calculating the return value.

Throughput is the reciprocal throughput of executing each function in nano sec, which means that the function can be executed once per the indicated time period.

One execution of a SIMD version of function returns multiple values at a time.

AMD Ryzen 9 3900X (on VirtualBox)

Smaller values are better.

	Latency w/o	Latency w/	Throughput w/o	Throughput w/
Scalar double sqrt	10.8974	7.07204	2.37942	1.63928
Scalar double div	8.16066	6.25515	1.36006	1.29232
Scalar double combination 1	12.7868	11.1754	5.1786	2.92174
Scalar double combination 2	39.46	9.25341	14.995	2.29033
Scalar float sqrt	10.328	7.06584	1.54236	1.7094
Scalar float div	7.34748	6.25499	0.95567	1.207
Scalar float combination 1	12.1938	11.1764	2.97841	2.92405
Scalar float combination 2	27.4484	9.25973	9.90441	2.28873
V2 double sqrt	8.41558	5.02199	2.44916	1.60365
V2 double div	8.41483	5.03158	1.36246	1.10354
V2 double combination 1	11.2803	8.77854	6.66632	2.53325
V2 double combination 2	37.7507	7.6074	15.1751	2.66222
V4 float sqrt	7.87997	4.97866	1.56829	1.59725
V4 float div	7.87941	4.98164	1.22538	0.737157
V4 float combination 1	9.53807	8.28124	3.8414	2.65078
V4 float combination 2	25.5575	7.61317	11.4465	2.66919
V4 double sqrt	8.4247	7.31245	2.53925	2.55821
V4 double div	6.06497	3.71661	2.18469	2.19789
V4 double combination 1	11.3289	8.75939	6.69132	2.55606
V4 double combination 2	37.7666	10.5493	15.2794	4.25895
V8 float sqrt	8.13279	7.31029	1.62507	2.57191
V8 float div	6.80213	3.74614	1.32429	0.863688
V8 float combination 1	9.25603	8.22829	3.84619	2.70795
V8 float combination 2	25.7601	10.6259	11.436	4.32672

Intel Core i5-8400

	Latency w/o	Latency w/	Throughput w/o	Throughput w/
Scalar double sqrt	9.83333	6.94472	1.58385	2.10879
Scalar double div	8.29314	6.69722	1.43114	1.76826
Scalar double combination 1	11.7486	11.6426	3.80272	4.25498
Scalar double combination 2	35.428	9.25952	10.0344	3.36071
Scalar float sqrt	7.0636	5.88225	1.76003	1.70523
Scalar float div	6.79956	5.74564	1.03727	1.39237
Scalar float combination 1	8.85469	11.1432	2.51212	4.03364
Scalar float combination 2	23.8257	9.06618	5.50221	2.92968
V2 double sqrt	8.40408	6.99574	1.5047	2.17567
V2 double div	8.40506	6.99687	1.07378	1.09575
V2 double combination 1	10.3536	9.56395	4.76477	2.25697
V2 double combination 2	34.9246	10.4645	10.033	3.09594
V4 float sqrt	6.90264	6.62587	0.86425	1.73259
V4 float div	6.90431	6.62504	0.931655	1.27627
V4 float combination 1	8.50001	8.43789	2.65377	1.80429
V4 float combination 2	21.7596	10.3026	5.35041	2.72136
V4 double sqrt	8.40729	8.70547	3.00989	3.10562
V4 double div	7.45602	5.82322	2.00891	1.59794
V4 double combination 1	14.1625	9.55264	9.02738	3.01147
V4 double combination 2	34.9217	12.1894	20.0653	4.55949
V8 float sqrt	6.91123	8.6802	1.58282	2.63874
V8 float div	8.19075	5.80181	1.30103	0.800619
V8 float combination 1	8.49971	8.43436	2.66911	1.86872
V8 float combination 2	21.7696	12.1922	5.42351	4.32093

Intel Pentium J5005

	Latency w/o	Latency w/	Throughput w/o	Throughput w/
Scalar double sqrt	16.5441	14.023	5.05278	5.83582
Scalar double div	14.0182	11.8617	3.73546	4.57744
Scalar double combination 1	26.9567	21.208	15.8273	9.72597
Scalar double combination 2	53.7198	16.8929	34.5079	7.56289
Scalar float sqrt	16.5306	12.7593	4.67547	5.68942
Scalar float div	11.5015	10.431	3.32477	4.41451
Scalar float combination 1	19.7651	19.5915	9.72312	8.86888
Scalar float combination 2	42.7687	15.452	19.0683	6.80609
V2 double sqrt	18.3472	12.1431	9.37445	6.59464
V2 double div	18.3512	12.142	6.49753	3.8851
V2 double combination 1	36.3182	18.345	28.7839	9.3791
V2 double combination 2	79.7839	14.7554	63.2677	8.67484
V4 float sqrt	14.0528	10.1099	5.12676	6.42582
V4 float div	14.0522	10.1122	4.36621	3.90253
V4 float combination 1	16.9448	16.5692	11.4089	8.74964
V4 float combination 2	40.9969	14.4236	24.6578	8.76855

Discussion

On x86 architectures, reciprocal approximation instructions are used for single-precision division with -ffast-math option. Because of this, single-precision results are not improved so much as double-precision results.

Ryzen 9 3900X and Core i5-8400 have fully pipelined and vectorized division and sqrt units, while those units are not fully vectorized on Pentium J5005. Division and sqrt are very fast on Core i5-8400, but still the proposed plugin improves the latency pretty much. On other two processors, we see clear improvement of latency by the plugin, and it improves throughput in most of the cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark results

AMD Ryzen 9 3900X (on VirtualBox)

Intel Core i5-8400

Intel Pentium J5005

Discussion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally