Skip to content

SSE/AVX byte multiplication could be improved #109775

@IJzerbaard

Description

@IJzerbaard

SSE/AVX don't have a built-in multiplication of bytes, so some trick is needed. .NET 9 uses a strategy like this, based on widening a vector to twice as wide then using 16-bit multiplication and narrowing the result:

   vpmovzxbw ymm0, qword ptr [rdx]
   vpmovzxbw ymm1, qword ptr [r8]
   vpmullw  ymm0, ymm0, ymm1
   vpand    ymm0, ymm0, ymmword ptr [reloc @RWD00]
   vpackuswb ymm0, ymm0, ymm0
   vpermq   ymm0, ymm0, -40

Which isn't bad, but we can do a little bit better. The central issue in the code above is that shuffles are expensive and there are 4 of them (vpmovzxbw * 2, vpackuswb, vpermq). On for example an Intel Skylake or Intel Rocket Lake or anything in between, those 4 shuffles limit the throughput to at best 1 of those multiplications per 4 cycles.

I suggest two alternative strategies.

Strategy 1

First, using an odd/even split and two vpmullw, based on (but modified) this QA on stackoverflow, in C++ it would look like this.

    __m128i mullo_epi8(__m128i a, __m128i b)
    {
        // unpack and multiply
        __m128i dst_even = _mm_mullo_epi16(a, b);
        __m128i dst_odd = _mm_mullo_epi16(_mm_srli_epi16(a, 8),_mm_srli_epi16(b, 8));
        // repack
        return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_and_si128(dst_even, _mm_set1_epi16(0xFF)));
    }

(the original answer tries to avoid the _mm_set1_epi16 if there is no AVX2 for VPBROADCASTW, but it would turn into a vector constant anyway so that is not important)

By avoiding pmovzxbw this strategy is also applicable to SSE2 rather than requiring SSE4.1, and it's not a bad strategy for the higher x64 ISA levels either.

In a comparable situation as the cited "current implementation" (loading a and b from memory and ignoring for now where the result goes),

    vmovdqa xmm0, xmmword ptr [rdi]
    vmovdqa xmm1, xmmword ptr [rsi]
    vpmullw xmm2, xmm1, xmm0
    vpsrlw  xmm0, xmm0, 8
    vpsrlw  xmm1, xmm1, 8
    vpmullw xmm0, xmm1, xmm0
    vpsllw  xmm0, xmm0, 8
    vpand   xmm1, xmm2, xmmword ptr [rip + .LCPI1_0]
    vpor    xmm0, xmm1, xmm0

This is more instructions, but they're cheaper instructions. On Skylake or Rocket lake, UICA estimates a throughput of performing the above thing (in a loop) once per 2.5 cycles (bottleneck spread evenly across p0 and p1), compared to the original one per 4 cycles (bottlenecked by p5, due to the shuffles).

Strategy 2

Secondly, a strategy based on pmaddubsw (requiring SSSE3). pmaddubsw multiplies unsigned by signed bytes, which is odd but turns out to be harmless (the lowest byte of the product is not affected, and the upper byte will be discarded). It also horizontally adds adjacent pairs, but that can be made harmless by making half of the pairs zero. The saturation does not come into play in that case either, neither 255*-128 nor 255*127 would saturate by themselves and the other element in each pair will be zero.

    __m128i mullo_epi8(__m128i a, __m128i b)
    {
        __m128i dst_even = _mm_maddubs_epi16(a, _mm_and_si128(b, _mm_set1_epi16(0xFF)));
        __m128i dst_odd = _mm_slli_epi16(_mm_maddubs_epi16(a, _mm_andnot_si128(_mm_set1_epi16(0xFF), b)), 8);
        return _mm_or_si128(_mm_and_si128(dst_even, _mm_set1_epi16(0xFF)), dst_odd);
    }

Representative code (the andnot was compiled into an and by inverted mask, for older processors it may be better to reduce the number of loads?):

    vmovdqa xmm0, xmmword ptr [rdi]
    vmovdqa xmm1, xmmword ptr [rsi]
    vmovdqa xmm2, xmmword ptr [rip + .LCPI2_0]
    vpand   xmm3, xmm1, xmm2
    vpmaddubsw      xmm3, xmm0, xmm3
    vpand   xmm1, xmm1, xmmword ptr [rip + .LCPI2_1]
    vpmaddubsw      xmm0, xmm0, xmm1
    vpsllw  xmm0, xmm0, 8
    vpand   xmm1, xmm3, xmm2
    vpor    xmm0, xmm1, xmm0

For this code (when put in a loop), UICA estimates a slightly worse throughput of once per 2.75 cycles on Skylake (due to not executing from the loop stream buffer but from the uop cache, I don't know why that happens here), but a slightly better throughput of once per 2.33 cycles on Rocket Lake.

This strategy is essentially what Clang 19 does when autovectorizing a byte multiplication, in earlier versions Clang did it differently (with more shuffles).

I could not evaluate either of these strategies (nor the original) on AMD Zen since I don't have access to one and UICA does not include it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMItenet-performancePerformance related issue

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions