SSE/AVX don't have a built-in multiplication of bytes, so some trick is needed. .NET 9 uses a strategy like this, based on widening a vector to twice as wide then using 16-bit multiplication and narrowing the result:
vpmovzxbw ymm0, qword ptr [rdx]
vpmovzxbw ymm1, qword ptr [r8]
vpmullw ymm0, ymm0, ymm1
vpand ymm0, ymm0, ymmword ptr [reloc @RWD00]
vpackuswb ymm0, ymm0, ymm0
vpermq ymm0, ymm0, -40
Which isn't bad, but we can do a little bit better. The central issue in the code above is that shuffles are expensive and there are 4 of them (vpmovzxbw * 2, vpackuswb, vpermq). On for example an Intel Skylake or Intel Rocket Lake or anything in between, those 4 shuffles limit the throughput to at best 1 of those multiplications per 4 cycles.
I suggest two alternative strategies.
Strategy 1
First, using an odd/even split and two vpmullw, based on (but modified) this QA on stackoverflow, in C++ it would look like this.
__m128i mullo_epi8(__m128i a, __m128i b)
{
// unpack and multiply
__m128i dst_even = _mm_mullo_epi16(a, b);
__m128i dst_odd = _mm_mullo_epi16(_mm_srli_epi16(a, 8),_mm_srli_epi16(b, 8));
// repack
return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_and_si128(dst_even, _mm_set1_epi16(0xFF)));
}
(the original answer tries to avoid the _mm_set1_epi16 if there is no AVX2 for VPBROADCASTW, but it would turn into a vector constant anyway so that is not important)
By avoiding pmovzxbw this strategy is also applicable to SSE2 rather than requiring SSE4.1, and it's not a bad strategy for the higher x64 ISA levels either.
In a comparable situation as the cited "current implementation" (loading a and b from memory and ignoring for now where the result goes),
vmovdqa xmm0, xmmword ptr [rdi]
vmovdqa xmm1, xmmword ptr [rsi]
vpmullw xmm2, xmm1, xmm0
vpsrlw xmm0, xmm0, 8
vpsrlw xmm1, xmm1, 8
vpmullw xmm0, xmm1, xmm0
vpsllw xmm0, xmm0, 8
vpand xmm1, xmm2, xmmword ptr [rip + .LCPI1_0]
vpor xmm0, xmm1, xmm0
This is more instructions, but they're cheaper instructions. On Skylake or Rocket lake, UICA estimates a throughput of performing the above thing (in a loop) once per 2.5 cycles (bottleneck spread evenly across p0 and p1), compared to the original one per 4 cycles (bottlenecked by p5, due to the shuffles).
Strategy 2
Secondly, a strategy based on pmaddubsw (requiring SSSE3). pmaddubsw multiplies unsigned by signed bytes, which is odd but turns out to be harmless (the lowest byte of the product is not affected, and the upper byte will be discarded). It also horizontally adds adjacent pairs, but that can be made harmless by making half of the pairs zero. The saturation does not come into play in that case either, neither 255*-128 nor 255*127 would saturate by themselves and the other element in each pair will be zero.
__m128i mullo_epi8(__m128i a, __m128i b)
{
__m128i dst_even = _mm_maddubs_epi16(a, _mm_and_si128(b, _mm_set1_epi16(0xFF)));
__m128i dst_odd = _mm_slli_epi16(_mm_maddubs_epi16(a, _mm_andnot_si128(_mm_set1_epi16(0xFF), b)), 8);
return _mm_or_si128(_mm_and_si128(dst_even, _mm_set1_epi16(0xFF)), dst_odd);
}
Representative code (the andnot was compiled into an and by inverted mask, for older processors it may be better to reduce the number of loads?):
vmovdqa xmm0, xmmword ptr [rdi]
vmovdqa xmm1, xmmword ptr [rsi]
vmovdqa xmm2, xmmword ptr [rip + .LCPI2_0]
vpand xmm3, xmm1, xmm2
vpmaddubsw xmm3, xmm0, xmm3
vpand xmm1, xmm1, xmmword ptr [rip + .LCPI2_1]
vpmaddubsw xmm0, xmm0, xmm1
vpsllw xmm0, xmm0, 8
vpand xmm1, xmm3, xmm2
vpor xmm0, xmm1, xmm0
For this code (when put in a loop), UICA estimates a slightly worse throughput of once per 2.75 cycles on Skylake (due to not executing from the loop stream buffer but from the uop cache, I don't know why that happens here), but a slightly better throughput of once per 2.33 cycles on Rocket Lake.
This strategy is essentially what Clang 19 does when autovectorizing a byte multiplication, in earlier versions Clang did it differently (with more shuffles).
I could not evaluate either of these strategies (nor the original) on AMD Zen since I don't have access to one and UICA does not include it.
SSE/AVX don't have a built-in multiplication of bytes, so some trick is needed. .NET 9 uses a strategy like this, based on widening a vector to twice as wide then using 16-bit multiplication and narrowing the result:
Which isn't bad, but we can do a little bit better. The central issue in the code above is that shuffles are expensive and there are 4 of them (vpmovzxbw * 2, vpackuswb, vpermq). On for example an Intel Skylake or Intel Rocket Lake or anything in between, those 4 shuffles limit the throughput to at best 1 of those multiplications per 4 cycles.
I suggest two alternative strategies.
Strategy 1
First, using an odd/even split and two vpmullw, based on (but modified) this QA on stackoverflow, in C++ it would look like this.
(the original answer tries to avoid the _mm_set1_epi16 if there is no AVX2 for
VPBROADCASTW, but it would turn into a vector constant anyway so that is not important)By avoiding pmovzxbw this strategy is also applicable to SSE2 rather than requiring SSE4.1, and it's not a bad strategy for the higher x64 ISA levels either.
In a comparable situation as the cited "current implementation" (loading a and b from memory and ignoring for now where the result goes),
This is more instructions, but they're cheaper instructions. On Skylake or Rocket lake, UICA estimates a throughput of performing the above thing (in a loop) once per 2.5 cycles (bottleneck spread evenly across p0 and p1), compared to the original one per 4 cycles (bottlenecked by p5, due to the shuffles).
Strategy 2
Secondly, a strategy based on pmaddubsw (requiring SSSE3). pmaddubsw multiplies unsigned by signed bytes, which is odd but turns out to be harmless (the lowest byte of the product is not affected, and the upper byte will be discarded). It also horizontally adds adjacent pairs, but that can be made harmless by making half of the pairs zero. The saturation does not come into play in that case either, neither 255*-128 nor 255*127 would saturate by themselves and the other element in each pair will be zero.
Representative code (the andnot was compiled into an and by inverted mask, for older processors it may be better to reduce the number of loads?):
For this code (when put in a loop), UICA estimates a slightly worse throughput of once per 2.75 cycles on Skylake (due to not executing from the loop stream buffer but from the uop cache, I don't know why that happens here), but a slightly better throughput of once per 2.33 cycles on Rocket Lake.
This strategy is essentially what Clang 19 does when autovectorizing a byte multiplication, in earlier versions Clang did it differently (with more shuffles).
I could not evaluate either of these strategies (nor the original) on AMD Zen since I don't have access to one and UICA does not include it.