SSE/AVX byte multiplication could be improved

SSE/AVX don't have a built-in multiplication of bytes, so some trick is needed. .NET 9 uses a strategy like this, based on widening a vector to twice as wide then using 16-bit multiplication and narrowing the result:

       vpmovzxbw ymm0, qword ptr [rdx]
       vpmovzxbw ymm1, qword ptr [r8]
       vpmullw  ymm0, ymm0, ymm1
       vpand    ymm0, ymm0, ymmword ptr [reloc @RWD00]
       vpackuswb ymm0, ymm0, ymm0
       vpermq   ymm0, ymm0, -40

Which isn't bad, but we can do a little bit better. The central issue in the code above is that shuffles are expensive and there are 4 of them (vpmovzxbw * 2, vpackuswb, vpermq). On for example an Intel Skylake or Intel Rocket Lake or anything in between, those 4 shuffles limit the throughput to at best 1 of those multiplications per 4 cycles.

I suggest two alternative strategies.

### Strategy 1

First, using an odd/even split and two vpmullw, based on (but modified) [this QA on stackoverflow](https://stackoverflow.com/a/29155682/), in C++ it would look like this.
        
        __m128i mullo_epi8(__m128i a, __m128i b)
        {
            // unpack and multiply
            __m128i dst_even = _mm_mullo_epi16(a, b);
            __m128i dst_odd = _mm_mullo_epi16(_mm_srli_epi16(a, 8),_mm_srli_epi16(b, 8));
            // repack
            return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_and_si128(dst_even, _mm_set1_epi16(0xFF)));
        }

(the original answer tries to avoid the _mm_set1_epi16 if there is no AVX2 for `VPBROADCASTW`, but it would turn into a vector constant anyway so that is not important)

By avoiding pmovzxbw this strategy is also applicable to SSE2 rather than requiring SSE4.1, and it's not a bad strategy for the higher x64 ISA levels either.

In a comparable situation as the cited "current implementation" (loading a and b from memory and ignoring for now where the result goes),

        vmovdqa xmm0, xmmword ptr [rdi]
        vmovdqa xmm1, xmmword ptr [rsi]
        vpmullw xmm2, xmm1, xmm0
        vpsrlw  xmm0, xmm0, 8
        vpsrlw  xmm1, xmm1, 8
        vpmullw xmm0, xmm1, xmm0
        vpsllw  xmm0, xmm0, 8
        vpand   xmm1, xmm2, xmmword ptr [rip + .LCPI1_0]
        vpor    xmm0, xmm1, xmm0

This is more instructions, but they're cheaper instructions. On Skylake or Rocket lake, [UICA](https://uica.uops.info/) estimates a throughput of performing the above thing (in a loop) once per 2.5 cycles (bottleneck spread evenly across p0 and p1), compared to the original one per 4 cycles (bottlenecked by p5, due to the shuffles).

### Strategy 2

Secondly, a strategy based on pmaddubsw (requiring SSSE3). pmaddubsw multiplies unsigned by signed bytes, which is odd but turns out to be harmless (the lowest byte of the product is not affected, and the upper byte will be discarded). It also horizontally adds adjacent pairs, but that can be made harmless by making half of the pairs zero. The saturation does not come into play in that case either, neither 255*-128 nor 255*127 would saturate by themselves and the other element in each pair will be zero.



        __m128i mullo_epi8(__m128i a, __m128i b)
        {
            __m128i dst_even = _mm_maddubs_epi16(a, _mm_and_si128(b, _mm_set1_epi16(0xFF)));
            __m128i dst_odd = _mm_slli_epi16(_mm_maddubs_epi16(a, _mm_andnot_si128(_mm_set1_epi16(0xFF), b)), 8);
            return _mm_or_si128(_mm_and_si128(dst_even, _mm_set1_epi16(0xFF)), dst_odd);
        }

Representative code (the andnot was compiled into an and by inverted mask, for older processors it may be better to reduce the number of loads?):

        vmovdqa xmm0, xmmword ptr [rdi]
        vmovdqa xmm1, xmmword ptr [rsi]
        vmovdqa xmm2, xmmword ptr [rip + .LCPI2_0]
        vpand   xmm3, xmm1, xmm2
        vpmaddubsw      xmm3, xmm0, xmm3
        vpand   xmm1, xmm1, xmmword ptr [rip + .LCPI2_1]
        vpmaddubsw      xmm0, xmm0, xmm1
        vpsllw  xmm0, xmm0, 8
        vpand   xmm1, xmm3, xmm2
        vpor    xmm0, xmm1, xmm0

For this code (when put in a loop), UICA estimates a slightly worse throughput of once per 2.75 cycles on Skylake (due to not executing from the loop stream buffer but from the uop cache, I don't know why that happens here), but a slightly better throughput of once per 2.33 cycles on Rocket Lake.

This strategy is essentially what [Clang 19 does when autovectorizing a byte multiplication](https://godbolt.org/z/TYP4jo4re), in earlier versions Clang did it differently (with more shuffles).

I could not evaluate either of these strategies (nor the original) on AMD Zen since I don't have access to one and UICA does not include it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSE/AVX byte multiplication could be improved #109775

Strategy 1

Strategy 2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

SSE/AVX byte multiplication could be improved #109775

Description

Strategy 1

Strategy 2

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions