Recognize the Vector128/256.AsVector* methods as intrinsic#1280
Recognize the Vector128/256.AsVector* methods as intrinsic#1280tannergooding merged 4 commits intodotnet:masterfrom
Vector128/256.AsVector* methods as intrinsic#1280Conversation
There was a problem hiding this comment.
AsVector4 is the simplest, we have a TYP_SIMD16 and are going to a TYP_SIMD16
AsVector2 and AsVector3 should also be simple. We have a TYP_SIMD16 and we want a TYP_SIMD8 or TYP_SIMD12. If they are already in register, there is no change needed. If they are in memory, we have 16-bytes and are correctly aligned, so it should likewise be safe to just access.
There was a problem hiding this comment.
If they are already in register, there is no change needed.
Right, but we should ensure that they get the right class layout.
There was a problem hiding this comment.
If they are already in register, there is no change needed
I'm not so sure. Some SIMD8/12 operations intentionally zero out the upper unused elements (SIMDIntrinsicDiv and maybe others). Either that zeroing is unnecessary or AsVector2/3 should also zero out.
There was a problem hiding this comment.
I believe that's just to avoid the penalty around x / 0 though rather than a functional requirement.
We may want to have both options a "simple, but correct" API. That is, one which ensures that the upper bits are correctly zeroed
and a "fast, but unsafe" API. That is, one which leaves clearing/setting the upper bits to the user; that way the conversion is as low as possible (generally free).
There was a problem hiding this comment.
@CarolEidt, do you have an opinion here?
Should I, when converting Vector128<float> to Vector2/3, explicitly zero the untracked z/w fields or should that responsibility be left up to the consumer if they feel it is important?
As best as I can tell, it doesn't matter from a functional perspective and at worst could cause a minor perf hit if those fields end up containing a subnormal value or NaN.
There was a problem hiding this comment.
No, dpps allows you to specify which elements are consumed and which are ignored as part of the computation.
There was a problem hiding this comment.
Looks like we might need to update the codegen for Vector2.Dot however. It currently generates the same as Vector4 using 0xF1 as the control word and it should be 0x3F or 0x31. Vector3.Dot already correctly uses 0x71
There was a problem hiding this comment.
Yeah, just saw that. The SSE2 fallback for Vector2 is also using the Vector4 code from what I can tell
There was a problem hiding this comment.
Sounds like some room for improvement there then. I'll take a look at a few other cases and will log bugs to start fixing them.
There was a problem hiding this comment.
I reverted the change treating AsVector2/AsVector3 as intrinsic for right now since it requires some fixes to the SIMD code and since it looks like it requires a few fixes elsewhere in the JIT as well.
There was a problem hiding this comment.
AsVector128 is a bit more complex.
For Vector4 to Vector128 it should be simple as it is TYP_SIMD16 to TYP_SIMD16
The other two (which are currently unhandled) I'd like some input on how to handle.
If they are in register, it can be a nop. Likewise, if they were spilled to the stack, they can be a nop.
However, if it is a field or an array element, we have to read 8 or 12-bytes exactly (I probably need to propose Unsafe versions so we don't have to zero-init the upper element as well).
Should we carry the NI_Vector128_AsVector128 through to codegen and decide what to do there (nop vs actual read, depending on location) or is there something clever we can do earlier in the compilation?
There was a problem hiding this comment.
It would be nice to be able to handle this in a "generic" way - that is, using a CAST from one type to another that does the appropriate widening. However, I don't think the current handling of casts is "generic" in a good way. So, I'd probably vote for carrying the conversion through to Lowering and codegen.
There was a problem hiding this comment.
that is, using a CAST from one type to another that does the appropriate widening.
Such a cast shouldn't be needed, much like we don't need a cast from short to int after a short indir. A SIMD12 indir really produces a SIMD16 value.
There was a problem hiding this comment.
I'm going to leave this to a future PR for now. Vector2/3 to Vector128<float> is going to be left as a managed implementation which explicitly zeroes the unused upper elements.
In particular I want to get some more testing and samples done for this and don't want to block the other changes.
There was a problem hiding this comment.
AsVector is also fairly simple.
If we don't support AVX, we just go to the software fallback
If we do support AVX, but not AVX2 (meaning Vector<T> is 16-bytes) then we just need to GetLower (which is generally a nop, as we are either in register or a memory location that is correctly aligned and at least 16-bytes)
There was a problem hiding this comment.
AsVector256 (which is Vector<T> to Vector256<T>) is similar.
If we don't support AVX, we just go to the software fallback
If we support AVX2 (meaning Vector<T> is 32-bytes) then it is a nop
If we only support AVX we just set the upper 128-bits to zero via ToVector256 (likewise, we probably want an Unsafe API so users can get non-deterministic bits, as we've done for other apis).
There was a problem hiding this comment.
users can get non-deterministic bits
What does that mean? Who needs non-deterministic bits? This isn't a random number generator.
There was a problem hiding this comment.
The intent is to provide an API which provides fast conversion and requires the user to manage the upper bits themselves.
Similarly to how we have both ToVector256 and ToVector256Unsafe. The former explicitly zeros the upper 128-bits and the latter is an effective nop that just allows access to the upper 128-bits of the register.
e0cda97 to
4dc66de
Compare
|
CC. @dotnet/jit-contrib |
f4455fb to
861a68f
Compare
|
Just rebased onto current head. Working on resolving feedback given so far. |
|
The current changes results in the following diff (slimmed down to just the modified tests, there were no changes to other tests or framework/runtime assemblies): The diffs are essentially all cases like the following: - vmovaps qword ptr [rsp+A0H], xmm6
- xor rax, rax
- mov qword ptr [rsp+80H], rax
- mov qword ptr [rsp+88H], rax
- mov qword ptr [rsp+90H], rax
- mov qword ptr [rsp+98H], rax
+ vmovaps qword ptr [rsp+70H], xmm6
+ vmovaps qword ptr [rsp+60H], xmm7+ vextractf128 ymm7, ymm6, 1
call VectorAs__AsVectorUInt64:ValidateResult(Vector`1,Vector128`1,String):this
- vmovupd ymm0, ymmword ptr[rsp+80H]
- vmovupd ymmword ptr[rsp+60H], ymm0
- vmovupd xmm6, xmmword ptr [rsp+60H]
- vmovapd xmmword ptr [rsp+50H], xmm6
- vmovupd ymm0, ymmword ptr[rsp+80H]
- vmovupd ymmword ptr[rsp+20H], ymm0
+ vinsertf128 ymm6, ymm6, ymm7, 1
+ vmovaps ymm0, ymm6
+ vmovapd xmmword ptr [rsp+20H], xmm0
+ vmovupd ymmword ptr[rsp+30H], ymm6 |
7cf5f92 to
ff37586
Compare
Vector128/256.AsVector* methods as intrinsicVector128/256.AsVector* methods as intrinsic
|
This should be ready for review now. I did not treat the currently problematic cases (Vector2/3 to/from Vector128) as intrinsic for now. The other cases (Vector to/from Vector128/256 and Vector4 to/from Vector128) are handled as intrinsics. |
CarolEidt
left a comment
There was a problem hiding this comment.
I think it's best to avoid the unreached asserts here, but I see that they're all over the place, even in places where there might be a reasonable default action. I'll submit a separate issue to review this.
| { | ||
| // TYP_SIMD8 and TYP_SIMD12 currently only expose "safe" versions | ||
| // which zero the upper elements and so are implemented in managed. | ||
| unreached(); |
There was a problem hiding this comment.
This will expand to a noway_assert I believe. I think it's best to avoid these where there's a valid default action to take, especially in a non-optimizing phase. It will cause the compile to be re-tried with MinOpts. I think the better option for these cases is to have an assert such as assert(!"Unexpected intrinsic in impBaseIntrinsic"); followed by a break, which I believe will cause it to return nullptr as for an unsupported or unrecognized intrinsic.
There was a problem hiding this comment.
Seems reasonable. I'm going to go ahead and merge the PR as is and then will follow it up with a PR that adjusts the unreached() calls I can find in the hwintrinsic* files
This updates the
Vector128.AsVector*andVector256.AsVector*methods to be intrinsic so they can be low/zero cost where applicable.