Skip to content

Centralize RYU_32_BIT_PLATFORM and optimize shiftright128().#77

Merged
ulfjack merged 1 commit intoulfjack:masterfrom
StephanTLavavej:shiftright128
Aug 19, 2018
Merged

Centralize RYU_32_BIT_PLATFORM and optimize shiftright128().#77
ulfjack merged 1 commit intoulfjack:masterfrom
StephanTLavavej:shiftright128

Conversation

@StephanTLavavej
Copy link
Contributor

  • common.h: Centralize RYU_32_BIT_PLATFORM. This indicates whether we should avoid uint64_t operations for performance, and is activated by the MSVC platform macros _M_IX86 and _M_ARM (also defined by Clang on Windows). This centralization will make it easier to detect additional platforms as 32-bit. Note that if a 32-bit platform isn't detected here, the consequences are limited to performance, as Ryu will use correct-but-slower uint64_t operations.

  • d2s_intrinsics.h: Include common.h for RYU_32_BIT_PLATFORM.

  • Fix comment, as the shift value can be 49. (Example: r = 0xba6fa7161a4d6e0c with ieeeMantissa = 0x000fa7161a4d6e0c and ieeeExponent = 0x000003a6.) I've experimentally observed that the upper bound might be 57 instead of 58, but I can't prove that theoretically.

  • Optimize shiftright128() for 32-bit platforms. If we can assume dist >= 32 (which the optimizer can't see), then we can grab lo's most significant 32 bits with (uint32_t)(lo >> 32) (32-bit compilers can do this efficiently), and then perform a 32-bit shift for the remaining (dist - 32). Additionally, uint64_t | uint32_t is less work. I observe that this improves Clang 7 RC1 perf by 2.8% (80.382 ns improved to 78.193 ns) which seems nice for such a small, localized change. I also observe that this degrades MSVC perf by 0.6% (106.155 ns degraded to 106.759 ns), but (1) that's very small and (2) Clang is an existence proof that MSVC can do better (tracked by VSO#634761).

  • Extend #if defined(_M_IX86) to #ifdef RYU_32_BIT_PLATFORM and update comments accordingly. This extends the division-by-64-bit-constant workaround to ARM32 (for which LLVM#37932 was originally filed).

  • f2s.c: This already includes common.h, so we can use RYU_32_BIT_PLATFORM with no behavioral changes.

* common.h: Centralize RYU_32_BIT_PLATFORM. This indicates whether we
should avoid uint64_t operations for performance, and is activated by
the MSVC platform macros _M_IX86 and _M_ARM (also defined by Clang on
Windows). This centralization will make it easier to detect additional
platforms as 32-bit. Note that if a 32-bit platform isn't detected
here, the consequences are limited to performance, as Ryu will use
correct-but-slower uint64_t operations.

* d2s_intrinsics.h: Include common.h for RYU_32_BIT_PLATFORM.

* Fix comment, as the shift value can be 49. (Example:
`r = 0xba6fa7161a4d6e0c` with `ieeeMantissa = 0x000fa7161a4d6e0c` and
`ieeeExponent = 0x000003a6`.) I've experimentally observed that the
upper bound might be 57 instead of 58, but I can't prove that
theoretically.

* Optimize shiftright128() for 32-bit platforms. If we can assume
`dist >= 32` (which the optimizer can't see), then we can grab lo's
most significant 32 bits with `(uint32_t)(lo >> 32)` (32-bit compilers
can do this efficiently), and then perform a 32-bit shift for the
remaining `(dist - 32)`. Additionally, `uint64_t | uint32_t` is
less work. I observe that this improves Clang 7 RC1 perf by 2.8%
(80.382 ns improved to 78.193 ns) which seems nice for such a small,
localized change. I also observe that this degrades MSVC perf by 0.6%
(106.155 ns degraded to 106.759 ns), but (1) that's very small and
(2) Clang is an existence proof that MSVC can do better (tracked by
VSO#634761).

* Extend `#if defined(_M_IX86)` to `#ifdef RYU_32_BIT_PLATFORM` and
update comments accordingly. This extends the
division-by-64-bit-constant workaround to ARM32 (for which LLVM#37932
was originally filed).

* f2s.c: This already includes common.h, so we can use
RYU_32_BIT_PLATFORM with no behavioral changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants