Centralize RYU_32_BIT_PLATFORM and optimize shiftright128().#77
Merged
ulfjack merged 1 commit intoulfjack:masterfrom Aug 19, 2018
Merged
Centralize RYU_32_BIT_PLATFORM and optimize shiftright128().#77ulfjack merged 1 commit intoulfjack:masterfrom
ulfjack merged 1 commit intoulfjack:masterfrom
Conversation
* common.h: Centralize RYU_32_BIT_PLATFORM. This indicates whether we should avoid uint64_t operations for performance, and is activated by the MSVC platform macros _M_IX86 and _M_ARM (also defined by Clang on Windows). This centralization will make it easier to detect additional platforms as 32-bit. Note that if a 32-bit platform isn't detected here, the consequences are limited to performance, as Ryu will use correct-but-slower uint64_t operations. * d2s_intrinsics.h: Include common.h for RYU_32_BIT_PLATFORM. * Fix comment, as the shift value can be 49. (Example: `r = 0xba6fa7161a4d6e0c` with `ieeeMantissa = 0x000fa7161a4d6e0c` and `ieeeExponent = 0x000003a6`.) I've experimentally observed that the upper bound might be 57 instead of 58, but I can't prove that theoretically. * Optimize shiftright128() for 32-bit platforms. If we can assume `dist >= 32` (which the optimizer can't see), then we can grab lo's most significant 32 bits with `(uint32_t)(lo >> 32)` (32-bit compilers can do this efficiently), and then perform a 32-bit shift for the remaining `(dist - 32)`. Additionally, `uint64_t | uint32_t` is less work. I observe that this improves Clang 7 RC1 perf by 2.8% (80.382 ns improved to 78.193 ns) which seems nice for such a small, localized change. I also observe that this degrades MSVC perf by 0.6% (106.155 ns degraded to 106.759 ns), but (1) that's very small and (2) Clang is an existence proof that MSVC can do better (tracked by VSO#634761). * Extend `#if defined(_M_IX86)` to `#ifdef RYU_32_BIT_PLATFORM` and update comments accordingly. This extends the division-by-64-bit-constant workaround to ARM32 (for which LLVM#37932 was originally filed). * f2s.c: This already includes common.h, so we can use RYU_32_BIT_PLATFORM with no behavioral changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
common.h: Centralize RYU_32_BIT_PLATFORM. This indicates whether we should avoid uint64_t operations for performance, and is activated by the MSVC platform macros _M_IX86 and _M_ARM (also defined by Clang on Windows). This centralization will make it easier to detect additional platforms as 32-bit. Note that if a 32-bit platform isn't detected here, the consequences are limited to performance, as Ryu will use correct-but-slower uint64_t operations.
d2s_intrinsics.h: Include common.h for RYU_32_BIT_PLATFORM.
Fix comment, as the shift value can be 49. (Example:
r = 0xba6fa7161a4d6e0cwithieeeMantissa = 0x000fa7161a4d6e0candieeeExponent = 0x000003a6.) I've experimentally observed that the upper bound might be 57 instead of 58, but I can't prove that theoretically.Optimize shiftright128() for 32-bit platforms. If we can assume
dist >= 32(which the optimizer can't see), then we can grab lo's most significant 32 bits with(uint32_t)(lo >> 32)(32-bit compilers can do this efficiently), and then perform a 32-bit shift for the remaining(dist - 32). Additionally,uint64_t | uint32_tis less work. I observe that this improves Clang 7 RC1 perf by 2.8% (80.382 ns improved to 78.193 ns) which seems nice for such a small, localized change. I also observe that this degrades MSVC perf by 0.6% (106.155 ns degraded to 106.759 ns), but (1) that's very small and (2) Clang is an existence proof that MSVC can do better (tracked by VSO#634761).Extend
#if defined(_M_IX86)to#ifdef RYU_32_BIT_PLATFORMand update comments accordingly. This extends the division-by-64-bit-constant workaround to ARM32 (for which LLVM#37932 was originally filed).f2s.c: This already includes common.h, so we can use RYU_32_BIT_PLATFORM with no behavioral changes.