Conversation
|
Tagging subscribers to this area: @dotnet/area-system-io-hashing, @bartonjs, @vcsjones |
There was a problem hiding this comment.
Pull request overview
This PR refactors Adler-32’s SIMD implementation in System.IO.Hashing to a new strategy-based vectorized core, updates tests to stress delayed-modulo overflow scenarios, and wires the new SIMD source file into the build.
Changes:
- Added a new SIMD implementation (
Adler32Simd.cs) with AVX2 / SSSE3 / Arm64 (incl. DP) selection and shared vectorized update core. - Updated
Adler32to route vectorized updates through the new implementation and adjusted constant visibility. - Modified Adler32 tests to better stress overflow safety and expanded length coverage.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/libraries/System.IO.Hashing/tests/Adler32Tests.cs |
Updates large-input overflow-stress test and expands length coverage; removes a previous all-0xFF reference test. |
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Adler32Simd.cs |
Introduces the new SIMD implementation and strategy abstractions for vectorized Adler32 updates. |
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Adler32.cs |
Simplifies vectorization gating and delegates SIMD updating to the new implementation; exposes ModBase internally. |
src/libraries/System.IO.Hashing/src/System.IO.Hashing.csproj |
Includes the new Adler32Simd.cs for .NETCoreApp builds. |
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Adler32.Vectorized.cs
Show resolved
Hide resolved
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Adler32Simd.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Adler32.Vectorized.cs
Show resolved
Hide resolved
| { | ||
| data[i] = (byte)('a' + (i % 26)); | ||
| } | ||
|
|
There was a problem hiding this comment.
This test (and the other removed test) didn't check the actual boundary condition. For example, if NMax is changed to 8192 in the Adler32 implementation, the tests still pass.
The updated test correctly breaks if NMax is set as small as 5553.
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Adler32.Vectorized.cs
Show resolved
Hide resolved
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Adler32Simd.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Adler32.Vectorized.cs
Show resolved
Hide resolved
| => Vector128.IsHardwareAccelerated && source.Length >= Vector128<byte>.Count; | ||
|
|
||
| private static uint UpdateVectorized(uint adler, ReadOnlySpan<byte> source) | ||
| => Adler32Simd.UpdateVectorized(adler, source); |
There was a problem hiding this comment.
Why separate it out like this?
The JIT tends to special case 1 level of inlining differently from 2+ levels of inlining and so simple forwarders like this can hurt things more than help.
There was a problem hiding this comment.
The SIMD implementation is all in file-scoped types, so it has to be called from something in this file. I could make those types nested private, but since there are so many, I was trying to keep them entirely local. If you prefer the nested approach, I can easily change it, though I don't foresee any issues with inlining limits here given the core method is intentionally marked NoInlining.
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static uint UpdateVectorized(uint adler, ReadOnlySpan<byte> source) | ||
| { | ||
| if (Vector256.IsHardwareAccelerated && Avx2.IsSupported) |
There was a problem hiding this comment.
What hardware were you testing AVX512 on? I wouldn't expect it to be slower than Vector256 at all and at worst the same speed here.
There was a problem hiding this comment.
I'm on an AMD Zen5 (see full benchmark details in the PR description).
The AVX-512 implementation adds some extra high-latency calculations to the inner loop, so it's expected to be slower. It can't be made to match the perf of AVX2 using the same logic widened, because vpmaddubsw will overflow with the larger multipliers required for the wider inputs, and any adjustment made to prevent overflow breaks the pipelining. The whole thing is very latency-sensitive.
A fast Vector512 implementation could be done with the AVX10.2 unsigned dot product instructions (vpdpbuud), but I didn't bother with that since the hardware still doesn't exist in the wild. I could add it preemptively if you like.
There was a problem hiding this comment.
You should be able to treat it as 2x256 instead of as 1x512 to avoid the wider multiplier issue and still get the perf gains.
There was a problem hiding this comment.
Yes, that's what this PR does to improve 2x over main. Main uses 1x256 or 1x512. 2x256 is faster than either.
There was a problem hiding this comment.
I would expect that actual 2x256 (i.e. effectively unrolling) should be slightly slower than using actual 512 and treating it as 2x256, namely due to the denser code and not needing to manually pipeline the instructions. I would not expect the V512 path to be slower.
There was a problem hiding this comment.
This already gets instruction-level parallelism. At best, you could match the perf with V512, but that's a lot of extra complexity for nothing.
| if (Dp.IsSupported) | ||
| { | ||
| return UpdateCore<AdlerVector128, AccumulateArm64, DotProductArm64Dp>(adler, source); | ||
| } | ||
|
|
||
| return UpdateCore<AdlerVector128, AccumulateArm64, DotProductArm64>(adler, source); |
There was a problem hiding this comment.
What is the perf difference between these two paths? Is it worth the additional complexity here?
There was a problem hiding this comment.
The DP implementation is roughly twice as fast as base AdvSimd. Without it, this PR is only about 5% faster than Main on Arm64.
| return UpdateCore<AdlerVector128, AccumulateArm64, DotProductArm64>(adler, source); | ||
| } | ||
|
|
||
| return UpdateCore<AdlerVector128, AccumulateXplat, DotProductXplat>(adler, source); |
There was a problem hiding this comment.
What is the perf difference of the above code paths with the xplat path (all platforms)?
There was a problem hiding this comment.
Xplat is around 1/3 the speed of native on both x64 (if restricted to Vector128) and Arm64 (if restricted to AdvSimd base).
| // This is further optimized to: `high * 16 - high + low` | ||
| // and implemented as: `(high << 4) - high + low`. | ||
|
|
||
| Vector128<uint> vlo = values & (Vector128<uint>.AllBitsSet >>> 16); |
There was a problem hiding this comment.
why not Vector128.Create<uint>(ushort.MaxValue)?
There was a problem hiding this comment.
I actually wrote the code as I wanted it to be interpreted by JIT, i.e. pcmpeqd+psrld, but it gets constant folded and treated as a memory load anyway.
| { | ||
| Vector128<byte> bytes1 = Vector128.LoadUnsafe(ref sourceRef); | ||
| Vector128<byte> bytes2 = Vector128.LoadUnsafe(ref sourceRef, (uint)Vector128<byte>.Count); | ||
| sourceRef = ref Unsafe.Add(ref sourceRef, (uint)Vector128<byte>.Count * 2); |
There was a problem hiding this comment.
We're looking at ways to reduce or otherwise remove unsafe code like this. While we can't really remove LoadUnsafe, we have found it significantly less error-prone to never update sourceRef and to track the relevant offset indices instead, as it reduces risk of accidental GC holes.
There was a problem hiding this comment.
This is the exact same reference math done in the current implementation. What's changed between last week when that was approved and now?
There was a problem hiding this comment.
A comment was given then too. Last week was namely just taking the existing code "as is" and extending it for the parameterization.
This is touching the code with a slightly more significant and non-critical rewrite (even with quite a lot of it being the same and just moved down for sharing). Since we're actively doing work to reduce unsafe usage where feasible, then ideally we fix this up rather than continuing to persist the problematic code.
There was a problem hiding this comment.
I think you've mixed up the Adler32 and parameterized CRC32/64 PRs. Vectorized Adler32 was 100% new in #124409.
It's certainly possible to move to a buffer offset, but I think any code using LoadUnsafe requires scrutiny, and it's trivially provable that this code does not have the potential to create a GC hole.
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Adler32.Vectorized.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Adler32.Vectorized.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Adler32.Vectorized.cs
Outdated
Show resolved
Hide resolved
| public static Vector256<uint> DotProduct(Vector256<uint> addend, Vector256<byte> left, Vector256<byte> right) | ||
| { | ||
| Vector256<short> mad = Avx2.MultiplyAddAdjacent(left, right.AsSByte()); | ||
| return Avx2.MultiplyAddAdjacent(mad, Vector256<short>.One).AsUInt32() + addend; |
There was a problem hiding this comment.
Shouldn't this second one be HorizontalAddSaturate. The multiply by 1 is unnecessarily adding 2 cycles.
There was a problem hiding this comment.
It's a widen + add pairwise. There's actually a single dot product instruction that does it all in AvxVnni (vpdpbusd), but that creates a dependency on the uint accumulator that ends up making it just slightly slower than this form on the two Intel machines and one AMD machine I tried. Though, as mentioned above, I believe the unsigned form of dot product could be used to make a faster Vector512 implementation.
There was a problem hiding this comment.
It's a widen + add pairwise
Right, but the whole setup here is effectively just doing Sum(left * right) (reducing down to multiple 32-bit results, instead of one 8-bit result), which I'm pretty sure can be simplified to less than 5 cycles on anything Skylake or newer
There was a problem hiding this comment.
Right, and that's a dot product. The only widening dot product instructions I'm familiar with for x86 are VNNI. What exact instruction sequence are you thinking of?
| wps += ws1; | ||
|
|
||
| ws1 = Accumulate(ws1, bytes1, bytes2); |
There was a problem hiding this comment.
Why do we need to be doing a full reduction every loop iteration here for wps?
It seems like a simply widen + add and then only reduce outside the loop should be plenty sufficient here and likely provide a bigger perf win.
There was a problem hiding this comment.
These are different accumulators that move at different rates. The only easy thing to factor out is the multiplication of the previous sum by the number of bytes that it would be added to each iteration, and that's already done.
| ws2 = DotProduct(ws2, bytes1, weights1); | ||
| ws3 = DotProduct(ws3, bytes2, weights2); |
There was a problem hiding this comment.
Similar here. This is effectively just doing ws2 + Sum(bytes1 * weights1) and ws3 + Sum(bytes2 * weights2)
It isn't clear why the sum at this point is actually needed every inner loop iteration and why it couldn't be hoisted "out" so that it's done only in the outer loop
There was a problem hiding this comment.
It could be hoisted out, but then you still have to widen each element before accumulating, which is still expensive. See the first attempt at Arm64 acceleration in Stephen's PR for an idea what that looks like. It's not faster.
There was a problem hiding this comment.
Widening is significantly cheaper and more pipelineable (and at least on AVX512 has single instruction, single cycle versions that goto wider registers).
I would expect decent savings if we were only widening and not doing the reductions per inner loop iteration, particularly that would simplify the algorithm and allow better code sharing across all these platforms.
There was a problem hiding this comment.
I gave that a quick try again (tried it before a long time ago but didn't keep the result as it wasn't worthwhile). It's still slower.
If you think you can do better than this implementation, be my guest. This was the best perf I could get, on every platform.
There was a problem hiding this comment.
We're notably not strictly looking for the "best perf" on every platform.
Rather, we're looking for good enough perf given the complexity, expected payload sizes, and real world impact (not every function is going to be a bottleneck or matter if its taking 200ns vs 400ns).
So part of what's being considered here is whether the extra code complexity, generics, impact to NAOT image size, or other scenarios, etc are meaningful enough compared to just having the simpler and very slightly slower code.
-- With this being a case where I expect we can remove most of the per ISA customization and still get "close enough" or even matching on most hardware, particularly when not simply looking at "latest" Intel or AMD and rather at the broad range of typical hardware targets which tend to be a bit older (Haswell, Skylake, Ice Lake, Zen 2, etc).
There was a problem hiding this comment.
extra code complexity
Although this is more lines of code, I'd argue it's less complex, simply because it's more consistent. The current implementation uses different logic for different vector widths, inconsistent variable names, etc. The new implementation uses the same skeleton for all, with only very small kernels abstracted away per-platform and with names that are easy to follow.
impact to NAOT image size
It should be noted, this generic approach is an improvement for NAOT code size, because e.g. on x64, instead of dynamically dispatching between up to 3 different implementations depending on ISA support and input size, this will choose exactly 1, which will always be used for any input >= 16 bytes.
In the case of Arm64, it will compile up to 2 potential versions of the core method, but it moves the ISA check out to the dispatcher, avoiding dynamic checks in the inner loop. And if 2x performance isn't good enough to justify a second copy, why are we bothering to implement Vector256 (not to mention the Vector512 implementation I got rid of and which you've argued to bring back)?
This replaces the vectorized Adler32 implementation added in #124409
Major Differences
Vector512implementation, which was about 20% slower than theVector256implementation on compatible hardware.Vector256implementation by taking better advantage of pipelining to compensate for high-latency instructions.NMaxbytes, speeding large input processing.In all, this amounts to a roughly 2x perf increase on large inputs, and even more on small inputs that are not an even multiple of vector size.
Benchmark Summary
x64
Arm64
Detailed Benchmark Results
-----> In Here <-----
AMD AVX-512 (Zen 5)
Arm64 (Windows Dev Kit 2023)
Intel AVX2 (Skylake)