38217 – Clang/LLVM optimizes division and modulo worse than MSVC, part 2

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 38217 - Clang/LLVM optimizes division and modulo worse than MSVC, part 2

Summary: Clang/LLVM optimizes division and modulo worse than MSVC, part 2

Status:	NEW

Alias:	None

Product:	libraries
Classification:	Unclassified
Component:	Scalar Optimizations (show other bugs)
Version:	trunk
Hardware:	PC Windows NT

Importance:	P normal
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-07-18 19:58 PDT by Stephan T. Lavavej
Modified:	2021-07-07 09:26 PDT (History)
CC List:	6 users (show)

See Also:	51014
Fixed By Commit(s):

Attachments
Test case (801 bytes, text/plain) 2018-07-18 19:58 PDT, Stephan T. Lavavej	Details
Clang codegen for workaround (1.76 KB, text/plain) 2018-07-18 19:59 PDT, Stephan T. Lavavej	Details
Clang codegen for modulo (this is the bug) (1.84 KB, text/plain) 2018-07-18 19:59 PDT, Stephan T. Lavavej	Details
MSVC codegen for workaround (4.15 KB, text/plain) 2018-07-18 19:59 PDT, Stephan T. Lavavej	Details
MSVC codegen for modulo (this is fine) (4.15 KB, text/plain) 2018-07-18 20:00 PDT, Stephan T. Lavavej	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Stephan T. Lavavej 2018-07-18 19:58:10 PDT

Created attachment 20570 [details]
Test case

This appears to be a different bug than https://bugs.llvm.org/show_bug.cgi?id=37983 "Clang/LLVM optimizes division and modulo worse than MSVC" (which is probably a duplicate of https://bugs.llvm.org/show_bug.cgi?id=23106 "Division followed by modulo generates longer machine code than vice versa") because it involves modulo followed by division.

This affects the Ryu algorithm for printing floating-point numbers (https://github.com/ulfjack/ryu ) and therefore affects C++17 floating-point std::to_chars().

I observe that MSVC's codegen is unaffected by WORKAROUND, while Clang/LLVM generates less assembly code (which is faster when profiled in the real algorithm) for WORKAROUND.

Here's a Godbolt link demonstrating the codegen difference (this isn't Windows-specific): https://godbolt.org/g/uX1AD8

C:\Temp\TESTING_X64>cl
Microsoft (R) C/C++ Optimizing Compiler Version 19.16.26504 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

usage: cl [ option... ] filename... [ /link linkoption... ]

C:\Temp\TESTING_X64>clang-cl -m64 -v
clang version 6.0.0 (tags/RELEASE_600/final)
Target: x86_64-pc-windows-msvc
Thread model: posix
InstalledDir: S:\msvc\src\vctools\NonShip\ClangLLVM\bin

C:\Temp\TESTING_X64>type d2s.cpp
#include <stdint.h>
#include <string.h>

static const char DIGIT_TABLE[] =
"0001020304050607080910111213141516171819"
"2021222324252627282930313233343536373839"
"4041424344454647484950515253545556575859"
"6061626364656667686970717273747576777879"
"8081828384858687888990919293949596979899";

void d2s_buffered(uint64_t output, char * result) {
    uint32_t i = 0;

    while (output >= 10000) {
#ifdef WORKAROUND
        const uint32_t c = (uint32_t) (output - 10000 * (output / 10000));
#else
        const uint32_t c = (uint32_t) (output % 10000);
#endif

        output /= 10000;
        const uint32_t c0 = (c % 100) << 1;
        const uint32_t c1 = (c / 100) << 1;
        memcpy(result - i - 1, DIGIT_TABLE + c0, 2);
        memcpy(result - i - 3, DIGIT_TABLE + c1, 2);
        i += 4;
    }
}

C:\Temp\TESTING_X64>cl /EHsc /nologo /W4 /MT /O2 /c d2s.cpp /FAsc /Famsvc_workaround.cod /DWORKAROUND
d2s.cpp

C:\Temp\TESTING_X64>cl /EHsc /nologo /W4 /MT /O2 /c d2s.cpp /FAsc /Famsvc_modulo.cod
d2s.cpp

C:\Temp\TESTING_X64>git diff msvc_workaround.cod msvc_modulo.cod
diff --git a/msvc_workaround.cod b/msvc_modulo.cod
index 1be1419..aff234c 100644
--- a/msvc_workaround.cod
+++ b/msvc_modulo.cod
@@ -86,11 +86,11 @@ $LL2@d2s_buffer:

 ; 15   : #ifdef WORKAROUND
 ; 16   :         const uint32_t c = (uint32_t) (output - 10000 * (output / 10000));
+; 17   : #else^M
+; 18   :         const uint32_t c = (uint32_t) (output % 10000);^M

   00030        48 8b c7         mov     rax, rdi

-; 17   : #else
-; 18   :         const uint32_t c = (uint32_t) (output % 10000);
 ; 19   : #endif
 ; 20   :
 ; 21   :         output /= 10000;

C:\Temp\TESTING_X64>clang-cl -m64 /EHsc /nologo /W4 /MT /O2 /c d2s.cpp /FA /Faclang_workaround.asm /DWORKAROUND

C:\Temp\TESTING_X64>clang-cl -m64 /EHsc /nologo /W4 /MT /O2 /c d2s.cpp /FA /Faclang_modulo.asm

C:\Temp\TESTING_X64>git diff clang_workaround.asm clang_modulo.asm
diff --git a/clang_workaround.asm b/clang_modulo.asm
index 2a8cb49..6026638 100644
--- a/clang_workaround.asm
+++ b/clang_modulo.asm
@@ -29,19 +29,22 @@
        movq    %r9, %rax
        mulq    %r10
        shrq    $11, %rdx
-       imulq   $-10000, %rdx, %rax     # imm = 0xD8F0
-       addq    %r9, %rax
-       movl    %eax, %esi
-       imulq   $1374389535, %rsi, %rsi # imm = 0x51EB851F
-       shrq    $37, %rsi
-       imull   $100, %esi, %edi
-       subl    %edi, %eax
+       imulq   $10000, %rdx, %rax      # imm = 0x2710
+       movq    %r9, %rsi
+       subq    %rax, %rsi
+       imulq   $1374389535, %rsi, %rax # imm = 0x51EB851F
+       movq    %rax, %rdi
+       shrq    $37, %rdi
+       imull   $100, %edi, %edi
+       subl    %edi, %esi
+       shrq    $36, %rax
+       andl    $510, %eax              # imm = 0x1FE
        movl    %ecx, %edi
        movq    %r8, %rbx
        subq    %rdi, %rbx
-       movzwl  (%r11,%rax,2), %eax
-       movw    %ax, -1(%rbx)
-       movzwl  (%r11,%rsi,2), %eax
+       movzwl  (%r11,%rsi,2), %esi
+       movw    %si, -1(%rbx)
+       movzwl  (%rax,%r11), %eax
        movw    %ax, -3(%rbx)
        addl    $4, %ecx
        cmpq    $99999999, %r9          # imm = 0x5F5E0FF

Comment 1 Stephan T. Lavavej 2018-07-18 19:59:09 PDT

Created attachment 20571 [details]
Clang codegen for workaround

Comment 2 Stephan T. Lavavej 2018-07-18 19:59:31 PDT

Created attachment 20572 [details]
Clang codegen for modulo (this is the bug)

Comment 3 Stephan T. Lavavej 2018-07-18 19:59:49 PDT

Created attachment 20573 [details]
MSVC codegen for workaround

Comment 4 Stephan T. Lavavej 2018-07-18 20:00:09 PDT

Created attachment 20574 [details]
MSVC codegen for modulo (this is fine)

Comment 5 Craig Topper 2018-07-18 22:51:40 PDT

There are a couple issues I see here.

We process div by constant and rem by constant using magic multiplies completely independently. At one point in the process we manage to have a shr by 37 as the last step of the div conversion and that is used by the shl 1 from the input source code. We decide that a shr by 36 plus an and with -2 is a better way to do that. The and is later simplied to 510. Then after that we convert the urem and reintroduce the shr by 37, but we don't simplify it. This is how we ended up with a shift by 36 and 37 in the output.

When we expand urem we do the div expansion and then multiply that result by the constant and subtract it from the original value. Due to how the terminating condition of the loop was optimized to use the pre-divided value of 'output', this subtract requires a copy to avoid overwriting the register containing 'output'. The manual workaround avoids this because it got optimized to an add along with a multiply by negative constant. Maybe we can do that explicitly in the urem by constant code? Or add a DAG combine to visitSUB. InstCombine already has such a combine.

Comment 6 Nico Weber 2019-07-26 19:59:02 PDT

+Roman since https://reviews.llvm.org/rL364600 looks kind of similar.

Comment 7 Nico Weber 2019-07-26 20:03:51 PDT

Looks like https://reviews.llvm.org/D65298 might help here.

Comment 8 Roman Lebedev 2019-07-26 20:11:31 PDT

(In reply to Nico Weber from comment #7)
> Looks like https://reviews.llvm.org/D65298 might help here.

That will cause WORKAROUND codegen (or IR) to match WORKAROUND-less codegen.
But as far as i understand the WORKAROUND-less codegen on x86 is worse.
Put other way, D65298 will make things universally bad here.

Comment 9 Nico Weber 2019-07-27 09:15:47 PDT

Progress!

Comment 10 Craig Topper 2019-07-27 17:05:17 PDT

So r350239 made this better in 8.0, but it looks to have gotten longer sometime after that.

Comment 11 Craig Topper 2019-07-31 20:35:23 PDT

Looks like we're now shrinking one of the udiv/urem pairs to i16 in IR. We then turn this into i16 umul_lohi during div legalization. Then turned that into a zero_extend i16->i32, an i32 mul and a shift right by 16. The zero_extend should be unnecessary due to it being connected to urem by 10000 which can only produce 0-9999. So an any_extend should be enough. But we can't recognize that from the expanded urem.