37932 – Division of uin64_t by constant on ARM32 should multiply by inverse

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 37932 - Division of uin64_t by constant on ARM32 should multiply by inverse

Summary: Division of uin64_t by constant on ARM32 should multiply by inverse

Status:	NEW

Alias:	None

Product:	clang
Classification:	Unclassified
Component:	LLVM Codegen (show other bugs)
Version:	6.0
Hardware:	Other other

Importance:	P enhancement
Assignee:	Unassigned Clang Bugs

URL:	https://godbolt.org/g/c4UdcT
Keywords:

Depends on:
Blocks:

Reported:	2018-06-25 15:37 PDT by Axel Simon
Modified:	2018-08-15 18:30 PDT (History)
CC List:	5 users (show)

See Also:
Fixed By Commit(s):

Attachments
Testcases for division by 5, 10, 100, 100000000 (2.51 KB, text/plain) 2018-08-15 18:27 PDT, Stephan T. Lavavej	Details
Clang 7 RC1 x86 codegen for div.cpp (7.91 KB, text/plain) 2018-08-15 18:28 PDT, Stephan T. Lavavej	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Axel Simon 2018-06-25 15:37:20 PDT

Comment 1 Axel Simon 2018-06-25 15:41:30 PDT

While a 32-bit integer division a/b by a constant is computed as a * (1/b) using the multiply-high technique, a 64-bit division by a constant is not optimized at all on 32-bit platforms. Yet, in

https://gmplib.org/~tege/divcnst-pldi94.pdf

a technique is described to implement such divisions in a cheaper way.

Comment 2 Eli Friedman 2018-06-25 20:53:34 PDT

You can emit a 64x64->128 multiply on a 32-bit platform by expanding it to four 32x32->64 multiplies.  IIRC, someone on my team did an experiment with that on ARM, but we never found a practical case where it was actually profitable, so we didn't upstream the patch.  I can dig it up if you're interested.

Alternatively, I guess you could use the "Dividing udword by uword" technique described in that paper... but in general you'd have to split the division into two divides, and I don't think that comes out cheaper.

Comment 3 Axel Simon 2018-06-26 09:47:51 PDT

I have to admit that my performance numbers are obtained using gcc. But the godbolt link shows that LLVM is also calling out to the same division function that presumably is not much faster.

Regarding an improvement over the generic division function, I followed the process described in Hacker's delight where one multiplies by the inverse and then does a multiplication to compute an error to the actual solution. The book says that the process is manual since it involves finding patterns in the bit sequence of the inverse so that few shift and adds are needed to do the multiplication. I've implemented /3 and /300. We get these numbers on a 300MHz Cortex R5F CPU:

(suffixes: 32 = int32_t, 32u = uint32_t, 64 = int64_t, u64 = uint64_t, a,b,c fixed values that are not small)

a32 = b32 / c32: 0.0233 us/call
a64 = b64 / c64: 1.7500 us/call
a64u = b64u / c64u: 1.6433 us/call
a64 = DivideUint64ByConstant<3>(c64): 0.2100 us/call
a64 = DivideUint64ByConstant<300>(c64): 0.3500 us/call

These numbers are to be taken with a grain of salt due to instruction alignment but otherwise we found our benchmarks representative. So the specialization for specific constants is 10x slower than 32-bit division but nearly 5x faster than generic 64-bit division.

The implementation of /3 is paraphrased form Hacker's delight:

template <int kDivisor>
inline uint64_t DivModUint64ByConstant(uint64_t v, int *remainder);

template <>
inline uint64_t DivModUint64ByConstant<3>(uint64_t v, int *remainder) {
  // The reciprocal of 3 is the irrational binary number 0.0101 0101 ... .
  uint64_t q = (v >> 2) + (v >> 4);  // q = v*0.0101.
  q += q >> 4;                       // q = v*0.0101 0101.
  q += q >> 8;                       // q = v*0.0101 0101 0101 0101.
  q += q >> 16;
  q += q >> 32;
  uint32_t r = v - q * 3;
  const uint32_t k = 11 * r / 32;  // Let the compiler do the shifts.
  if (remainder) *remainder = r - (3 * k);
  return q + k;
}

Comment 4 Eli Friedman 2018-06-26 17:10:22 PDT

Tracked it down the patch I was thinking of; it was actually posted for review, but never got merged for some reason.  See https://reviews.llvm.org/D24822 .

Comment 5 Stephan T. Lavavej 2018-08-15 18:27:45 PDT

Created attachment 20727 [details]
Testcases for division by 5, 10, 100, 100000000

Comment 6 Stephan T. Lavavej 2018-08-15 18:28:31 PDT

Created attachment 20728 [details]
Clang 7 RC1 x86 codegen for div.cpp

Comment 7 Stephan T. Lavavej 2018-08-15 18:30:06 PDT

This dramatically affects the novel Ryu algorithm for converting 64-bit doubles to strings (see https://github.com/ulfjack/ryu/pull/73 ). See the attached div.cpp and div.asm for a minimal repro on x86 (compile with "clang-cl -m32 /EHsc /nologo /W4 /MT /O2 /c /FA div.cpp").