Optimize 64-bit division-by-constant for x86 platforms#73
Conversation
The idea is to move some of the basic building blocks of the Ryu algorithm into this file which might then be optimized for different compilers/architectures without decreasing code clarity.
On x86 platforms division-by-constant likely results in calls to library functions like __udivdi3 or __aulldiv, even if the divisor is constant. This patch performs division-by-constant on x86 platforms in the same way as x64 compilers would do, i.e., by a multiplication and shifts. To compute the high 64 bits of a 64x64 product, the current umul128 implementation can be reused. All tested compilers (MSVC, clang and g++) eliminate the instructions used to compute the low 64 bits of the product. Each occurrence of "x/5" or "x%10" has been replaced by a function like "div5(x)" or an expression like "x - 10*div10(x)". This makes the code less readable, but the performance increase is probably worth it. For MSVC x64 and g++ x64 the performance is unaffected. --- Benchmark results for MSVC x86 /O2: benchmark -ryu -64 Before: 64: 165.414 21.922 64: 165.394 19.978 64: 167.034 22.027 64: 164.965 19.600 After: 64: 122.904 13.817 64: 123.616 17.523 64: 123.353 14.597 64: 122.699 13.635 benchmark -ryu -64 -invert Before: 64: 235.964 10.823 64: 235.708 7.874 64: 236.425 11.082 64: 235.988 7.720 After: 64: 182.802 10.710 64: 182.420 11.635 64: 181.156 6.522 64: 181.247 7.254 --- Some more benchmarks using google's benchmark library: Benchmark: Time Before: After: ------------------------------------------------------------------------ // Test random bit patterns. BM_RandomBits_double 226 ns 170 ns 1.33x // Test numbers with x significand digits of the form "d.dddd". BM_Digits_double/_1 725 ns 409 ns 1.77x BM_Digits_double/_2 629 ns 362 ns 1.74x BM_Digits_double/_3 585 ns 331 ns 1.77x BM_Digits_double/_4 557 ns 313 ns 1.78x BM_Digits_double/_5 521 ns 304 ns 1.71x BM_Digits_double/_6 466 ns 289 ns 1.61x BM_Digits_double/_7 431 ns 271 ns 1.59x BM_Digits_double/_8 399 ns 256 ns 1.56x BM_Digits_double/_9 367 ns 239 ns 1.54x BM_Digits_double/_10 363 ns 231 ns 1.57x BM_Digits_double/_11 328 ns 214 ns 1.53x BM_Digits_double/_12 300 ns 198 ns 1.52x BM_Digits_double/_13 269 ns 185 ns 1.45x BM_Digits_double/_14 242 ns 170 ns 1.42x BM_Digits_double/_15 214 ns 155 ns 1.38x // Test uniformly distributed random numbers in the range [10^x, 10^y]. BM_Uniform_double/-30/-25 200 ns 149 ns 1.34x BM_Uniform_double/-25/-20 201 ns 147 ns 1.37x BM_Uniform_double/-20/-15 195 ns 144 ns 1.35x BM_Uniform_double/-15/-20 192 ns 141 ns 1.36x BM_Uniform_double/-10/-5 194 ns 144 ns 1.35x BM_Uniform_double/-5/-1 194 ns 146 ns 1.33x BM_Uniform_double/-1/1 194 ns 145 ns 1.34x BM_Uniform_double/1/5 195 ns 145 ns 1.34x BM_Uniform_double/5/10 192 ns 143 ns 1.34x BM_Uniform_double/10/15 232 ns 162 ns 1.43x BM_Uniform_double/15/20 244 ns 172 ns 1.42x BM_Uniform_double/20/25 235 ns 164 ns 1.43x BM_Uniform_double/25/30 246 ns 171 ns 1.44x BM_Uniform_double/30/35 241 ns 170 ns 1.42x BM_Uniform_double/35/40 211 ns 158 ns 1.34x BM_Uniform_double/40/45 197 ns 146 ns 1.35x BM_Uniform_double/45/50 191 ns 143 ns 1.34x
| // in the same way as 64-bit compilers would do. | ||
| // | ||
| // NB: | ||
| // The multipliers and shift values are the ones generated by clang x64 |
There was a problem hiding this comment.
Do we have an upstream bug that we can cite here?
There was a problem hiding this comment.
For clang I just found this: https://bugs.llvm.org/show_bug.cgi?id=37932
I haven't found something similar for gcc or msvc.
There was a problem hiding this comment.
And for gcc: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37443
There was a problem hiding this comment.
I've filed VSO#664754 "On x86, division by 64-bit constants could be dramatically more efficient" in MSVC's internal database, with your code as a minimal repro. (This is separate from VSO#634761 "C2 emits code that's 34.3% slower than Clang/LLVM for the Ryu algorithm" which is the overall bug for Ryu codegen quality.)
Add links to some bug reports.
|
This PR continues to have great performance on my i7-4790. With my benchmark changes in #74 (defaulting to "inverted mode" for realistic branch prediction), I observe a 45% improvement for MSVC, 48% for Clang, and larger improvements for smaller output (which performs more divisions to trim more digits):
|
On x86 platforms division-by-constant likely results in calls to library
functions like __udivdi3 or __aulldiv, even if the divisor is constant.
This patch performs division-by-constant on x86 platforms in the same
way as x64 compilers would do, i.e., by a multiplication and shifts.
To compute the high 64 bits of a 64x64 product, the current umul128
implementation can be reused. All tested compilers (MSVC, clang and g++)
eliminate the instructions used to compute the low 64 bits of the
product.
Each occurrence of "x/5" or "x%10" has been replaced by a function like
"div5(x)" or an expression like "x - 10*div10(x)". This makes the code
less readable, but the performance increase is probably worth it.
For MSVC x64 and g++ x64 the performance is unaffected.
Benchmark results for MSVC x86 /O2:
Some more benchmarks using google's benchmark library: