Improve Intel hardware intrinsic APIs#17637
Conversation
|
The new function names are horrible and in majority of cases do not follow .NET API coding guidelines by using acronyms. |
|
@CarolEidt @tannergooding @eerhardt @AndyAyersMS @mikedn @jkotas |
| /// PCMPISTRI xmm, xmm/m128, imm8 | ||
| /// int _mm_cmpistrs (__m128i a, __m128i b, const int imm8) | ||
| /// </summary> | ||
| public static bool CompareImplicitLengthNotCAndNotZ(Vector128<sbyte> left, Vector128<sbyte> right, StringComparisonMode mode) => CompareImplicitLengthNotCAndNotZ(left, right, mode); |
There was a problem hiding this comment.
Do we really need the words ImplicitLength and ExplicitLength in the method names? Isn't the fact that I used the overload without specifying the lengths enough to disambiguate that I wanted to use implicit lengths?
There was a problem hiding this comment.
Good point! Will try to eliminate these two words.
| /// PCMPISTRI xmm, xmm/m128, imm8 | ||
| /// int _mm_cmpistrz (__m128i a, __m128i b, const int imm8) | ||
| /// </summary> | ||
| public static bool CompareImplicitLengthNotCAndNotZ(Vector128<byte> left, Vector128<byte> right, StringComparisonMode mode) => CompareImplicitLengthNotCAndNotZ(left, right, mode); |
There was a problem hiding this comment.
What does C and Z mean?
What does O mean? Similiarly, what does the S suffix mean below?
They all represent the corresponding flags in the FLAGS register of x86/x64 CPUs.
don't use single letter abbreviations for prefixes/suffixes in names.
I agree, but now we have some similar APIs that "use single letter abbreviations", such as Sse41.TestC, Avx.TestZ, etc.
Do you have any suggestion to improve the names?
There was a problem hiding this comment.
I assume C == Carry, Z == Zero, O == Overflow, and S == Sign? Can we spell out the words? Or even better, if we could come up a more descriptive name of what the method does, that would be great.
Reading https://software.intel.com/sites/landingpage/IntrinsicsGuide/#!=undefined&text=_mm_cmpestr&expand=814,813,814,813, the only difference between S and Z that I can tell is which argument to check for a null character:
Compare packed strings in a and b with lengths la and lb using the control in imm8, and returns 1 if any character in a was null, and 0 otherwise.
vs.
Compare packed strings in a and b with lengths la and lb using the control in imm8, and returns 1 if any character in b was null, and 0 otherwise.
It seems the name should indicate/describe the differences, instead of using SFlag or ZFlag, and then forcing the caller to look up what SFlag means.
| /// int _mm_cmpistro (__m128i a, __m128i b, const int imm8) | ||
| /// PCMPISTRI xmm, xmm/m128, imm8 | ||
| /// </summary> | ||
| public static bool CompareImplicitLengthO(Vector128<byte> left, Vector128<byte> right, StringComparisonMode mode) => CompareImplicitLengthO(left, right, mode); |
There was a problem hiding this comment.
What does O mean? Similiarly, what does the S suffix mean below? I agree with @4creators that we typically don't use single letter abbreviations for prefixes/suffixes in names.
| /// VPGATHERDD xmm, vm32x, xmm | ||
| /// </summary> | ||
| public static unsafe Vector128<int> GatherVector128(int* baseAddress, Vector128<int> index, byte scale) => GatherVector128(baseAddress, index, scale); | ||
| public static unsafe Vector128<int> GatherVector128WithVector128Int32Indices(int* baseAddress, Vector128<int> index, byte scale) => GatherVector128WithVector128Int32Indices(baseAddress, index, scale); |
There was a problem hiding this comment.
While I can imagine it being easier for JIT implementation, these names are a bit unwieldy from a user's point of view IMO.
There was a problem hiding this comment.
Yes, but if we only rely on overload to distinguish these intrinsics, the current named-intrinsic framework of RyuJIT has to be changed. Meanwhile, in the future, other platforms (e.g., .NET Native, Mono, etc) are also hard to port hardware intrinsic.
JIT experts may have some thoughts? @CarolEidt @mikedn @AndyAyersMS
There was a problem hiding this comment.
The JIT intrinsic plumbing needs to be able to deal with overloads. It does it today in several cases. For example, CORINFO_INTRINSIC_Abs can be either Abs(float) or Abs(double). Or there are multiple intrinsic overloads of Vector::.ctor. If the named intrinsics are meant to be the one true way to identify intrinsics in future, it should be just fixed there.
Meanwhile, in the future, other platforms (e.g., .NET Native, Mono, etc) are also hard to port hardware intrinsic.
I do not think so.
There was a problem hiding this comment.
The current hardware intrinsics distinguish different overloads by "method name + base-type + SIMD size" in the IR. This mechanism works very well on most of the Intel hardware intrinsics, but it is not enough for Avx2.GatherVector*.
If the API shape is more important than the complexity of JIT implementation, I would look into changing the JIT framework or intrinsic IR to address Avx2.GatherVector*.
There was a problem hiding this comment.
It doesn't seem to me that it will be all that difficult to distinguish these, and given that it is not a common case, we could either choose to simply generate different NI* intrinsic names in the importer, based on the type of the relevant argument, or we could postpone that analysis until codegen, and look at the base type of the SIMD argument to determine which instruction to generate.
There was a problem hiding this comment.
And I would say that, in general, the API shape is more important than the complexity of the JIT implementation, as long as it is both "pay for play" (i.e. the complexity is only incurred for the case in question) and the complexity is not very high.
There was a problem hiding this comment.
Thanks, will retain the current APIs and address in JIT.
02608d9 to
92e6ade
Compare
|
Re-improved SSE4.1 intrinsic APIs (e.g., |
| /// INSERTPS xmm, xmm/m32, imm8 | ||
| /// </summary> | ||
| public static Vector128<float> Insert(Vector128<float> value, float data, byte index) => Insert(value, data, index); | ||
| public static Vector128<float> Insert(Vector128<float> value, Vector128<float> data, byte index) => Insert(value, data, index); |
There was a problem hiding this comment.
Corrected the SSE4.1 Insert API on float base type.
Fix #18143
There was a problem hiding this comment.
Should we have both overloads or will we specially handle the Sse.SetScalarVector128 and corresponding index masks to not zero the upper bits of data?
There was a problem hiding this comment.
Also, does the corresponding Avx API need to be updated as well?
There was a problem hiding this comment.
Should we have both overloads
Probably not, SSE4.1 insertps has really special semantics that inserts a value selected from the second vector argument into the first vector. But other SSE4.1 "insert" instructions direct insert a scalar value. https://msdn.microsoft.com/en-us/library/bb514071(v=vs.120).aspx
Also, does the corresponding Avx API need to be updated as well?
No, AVX insert instructions are "normal" :)
|
Addressed the above feedback and fixed SSE4.1 |
| /// PCMPISTRI xmm, xmm/m128, imm8 | ||
| /// int _mm_cmpistrs (__m128i a, __m128i b, const int imm8) | ||
| /// </summary> | ||
| public static bool CompareNotCAndNotZFlag(Vector128<sbyte> left, Vector128<sbyte> right, StringComparisonMode mode) => CompareNotCAndNotZFlag(left, right, mode); |
There was a problem hiding this comment.
In case my original reply gets lost in the outdated section, I've copied it here:
I assume C == Carry, Z == Zero, O == Overflow, and S == Sign? Can we spell out the words? Or even better, if we could come up a more descriptive name of what the method does, that would be great.
Reading https://software.intel.com/sites/landingpage/IntrinsicsGuide/#!=undefined&text=_mm_cmpestr&expand=814,813,814,813, the only difference between S and Z that I can tell is which argument to check for a null character:
Compare packed strings in a and b with lengths la and lb using the control in imm8, and returns 1 if any character in a was null, and 0 otherwise.
vs.
Compare packed strings in a and b with lengths la and lb using the control in imm8, and returns 1 if any character in b was null, and 0 otherwise.
It seems the name should indicate/describe the differences, instead of using SFlag or ZFlag, and then forcing the caller to look up what SFlag means.
There was a problem hiding this comment.
I assume C == Carry, Z == Zero, O == Overflow, and S == Sign? Can we spell out the words?
@eerhardt Right, and I am OK to spell them out if it matches the C# naming convention better.
Or even better, if we could come up a more descriptive name of what the method does, that would be great.
I do not think so. "A more descriptive name" may be too long and not clear. For example, NotCAndNotZFlag looks even clearer than "absolute value of lb is larger than or equal to MaxSize and the resulting mask is equal to zero". Meanwhile, we already assume that the users of HW intrinsics have been familar/mastering with the C++ counterparts or ISA design. So, I still suggest to name these intrinsics with flags.
There was a problem hiding this comment.
I still think we can come up with better names. It appears there are 5 cases we need to name:
| Flag | Description | Possible Names |
|---|---|---|
| S & Z |
Compare packed strings with implicit lengths in a and b using the control in imm8, and returns 1 if any character in a was null, and 0 otherwise. & Compare packed strings with implicit lengths in a and b using the control in imm8, and returns 1 if any character in b was null, and 0 otherwise. |
CompareLeftContainsNull, CompareReturnLeftContainsNull, CompareReturnPartialLeft, CompareGetIsPartialLeft, CompareIsLeftPartial (Same for Right) |
| O | Compare packed strings with implicit lengths in a and b using the control in imm8, and returns bit 0 of the resulting bit mask. | CompareReturnFirstBit, CompareGetFirstBit |
| C | Compare packed strings with implicit lengths in a and b using the control in imm8, and returns 1 if the resulting mask was non-zero, and 0 otherwise. | CompareIsResultNonZero, CompareHasNonZeroResult, CompareNonZeroMask, CompareHasResult |
| A | Compare packed strings with implicit lengths in a and b using the control in imm8, and returns 1 if b did not contain a null character and the resulting mask was zero, and 0 otherwise. |
A is definitely hard to describe succinctly, since it is the combination of two other ones. So coming up with a good name for it will probably be based on the two underlying names.
Note: The IntrinsicsGuide's description for the explicit length S & Z overloads (_mm_cmpestrs & _mm_cmpestrz) appears to be incorrect.
Compare packed strings in a and b with lengths la and lb using the control in imm8, and returns 1 if any character in b was null, and 0 otherwise.
But then the Operation says:
size := (imm8[0] ? 16 : 8) // 8 or 16-bit characters
UpperBound := (128 / size) - 1
dst := (lb <= UpperBound)
Where the Operation appears to match what is in the underlying instruction manuals.
In light of the implicit vs explicit differences, to name S & Z, that's why I like something like Is Partial because it describes at a higher-level what is being checked. ex. File.Exists instead of File.StatReturnsZero.
There was a problem hiding this comment.
@eerhardt Thank you so much for the investigation. I read the manual again and I agree that "a more descriptive name" is very necessary because FLAGS is used in a non-standard manner with these SSE4.2 string instructions.
According to the above table you provided, there is a potential name matrix:
| original names | new names |
|---|---|
| CompareCFlag | CompareNonZeroResultMask |
| CompareZFlag | CompareRightContainsNull |
| CompareSFlag | CompareLeftContainsNull |
| CompareOFlag | CompareReturnFirstResultBit |
| CompareNotCAndNotZFlag | CompareZeroResultMaskAndRightNotContainsNull |
Does it look good to you?
There was a problem hiding this comment.
The only issue I see if the difference between implicit and explicit length overloads. If we called the explicit length overloads ContainsNull that would be incorrect. It instead returns if the rightLength is less than the maximum (8 or 16). So we should use a different name for the explicit length overloads.
I'd like to keep them using the same name, if a decent name comes up. But if one isn't available, I'm fine with the implicit and explicit length overloads being named different (RightContainsNull vs. RightLessThanMax/RightIsPartial/etc.)
There was a problem hiding this comment.
I think we could make it more obvious by calling the Polarity flags something like NegateResult instead.
Reading if (ContainsMatch(op1, op2, StringComparisonMode.NegateResult)) seems much better (both readable and understandable) than if (CompareZeroResultMask(op1, op2, StringComparisonMode.NegatiePolarity))
There was a problem hiding this comment.
We could do EndOfString, since that applies to both Explicit (where end of string is length determined) and Implicit (where end of string is determined by the null character)?
I think this is a good idea and I also perfer to use the same name for Explicit and Implicit.
How about Left/RightTerminated?
There was a problem hiding this comment.
I think we could make it more obvious by calling the Polarity flags something like NegateResult instead.
Good point! We can rename these two polarity flags both:
/// <summary>
/// _SIDD_NEGATIVE_POLARITY
/// </summary>
NegativeResult = 0x10,
/// <summary>
/// _SIDD_MASKED_NEGATIVE_POLARITY
/// </summary>
NegativeUsefulResult = 0x30,There was a problem hiding this comment.
With the above renamed StringComparisonMode, we seems to have the new name matrix:
| original names | new names |
|---|---|
| CompareCFlag | CompareHasMatch |
| CompareZFlag | CompareRightTerminated |
| CompareSFlag | CompareLeftTerminated |
| CompareOFlag | CompareReturnFirstResultBit |
| CompareNotCAndNotZFlag | CompareNoMatchAndRightNotTerminated |
|
It might be good to split this into two PRs. One fixing the argument types and the other fixing the SSE4.2 API names. Did we also already have the discussion on whether it was better for mask/control parameters to be the same as the base type or if it was better to be the same-sized integer type? CC. @eerhardt for input on the API side of things. |
I am not sure. I put them together just for simplifying the CoreFX/CoreCLR synchronization. |
Right, but it sounds like the Sse4.2 changes are a bit more complex, and should probably have a bit more "in-depth" review. I'm fine with keeping them together, if you prefer, but it might be easier to do two separate changes. |
@tannergooding Thanks for clarifying. I agree that Sse4.2 changes are a bit more complex and it may impact other API naming. For example, if we decide not using "flags" (e.g., I would split these changes if Sse41 blocks other work. |
| /// PCMPISTRM xmm, xmm/m128, imm8 | ||
| /// </summary> | ||
| public static Vector128<ushort> CompareImplicitLengthBitMask(Vector128<sbyte> left, Vector128<sbyte> right, StringComparisonMode mode) { throw new PlatformNotSupportedException(); } | ||
| public static Vector128<ushort> CompareBitMask(Vector128<sbyte> left, Vector128<sbyte> right, StringComparisonMode mode) { throw new PlatformNotSupportedException(); } |
There was a problem hiding this comment.
For my information - Why did we break out BitMask vs UnitMask into different overloads? Why not keep it in the StringComparisonMode enum?
There was a problem hiding this comment.
Ah, good catch! This is a legacy design from the previous versions, will fix. Thanks!
CarolEidt
left a comment
There was a problem hiding this comment.
For my information - Why did we break out BitMask vs UnitMask into different overloads? Why not keep it in the StringComparisonMode enum?
To me, it seems wrong to have it in the StringComparisonMode enum, since the values overlap.
| /// <summary> | ||
| /// _SIDD_UNIT_MASK | ||
| /// </summary> | ||
| UnitMask = 0x40, |
There was a problem hiding this comment.
How are UnitMask and MostSignificant (or LeastSignificant and BitMask) distinguished? I don't think we should have multiple names for the same value in a single enum - are these used in different contexts? If so, I would think they'd be different enums.
There was a problem hiding this comment.
are these used in different contexts?
Yes.
UnitMask and BitMask is only used for CompareMask (PCMPESTRM).
LeastSignificant and MostSignificant is only used for CompareIndex (PCMPESTRI).
There was a problem hiding this comment.
If so, I would think they'd be different enums.
So, CompareMask and CompareIndex would have an additional imm parameter, which would complicate the JIT implementation (especially non-const fallback).
How about encoding these two flags into the intrinsic function name?
There was a problem hiding this comment.
How about encoding these two flags into the intrinsic function name?
I don't think that's necessary - I just don't see why the enums should be the same.
There was a problem hiding this comment.
I think it depends on whether we want to map closely with the underlying hardware instruction, which I thought was a desire.
Section 4.1.5 Output Selection in https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-2b-manual.pdf
Since it will be a Flags enum (after this PR), I don't think this is a huge issue. It follows the underlying HW instruction design, and is understandable - use Least|Most Significant with the CompareIndex method, use Bit|Unit Mask with the CompareMask method.
There was a problem hiding this comment.
Rethinking the open a new issue comment, I guess the current code has the Unit/BitMask methods split, so this PR is already changing this API.
I dislike exploding the methods out because:
- It increases the number of methods, and method names in intellisense/docs.
- It doesn't allow for a "default". I either need to pick - Bit or Unit mask - up front. Using an enum allows me to use the API without making the choice. And then if that choice doesn't work for me, I can change the enum value.
I do like the advantage that 3 enums provides in that Bit/UnitMask and Least/MostSignificant values don't affect the methods that return bool, so those methods don't even get those options. And also that CompareIndex doesn't get Bit/UnitMask option and vice versa. So if we were going to change the current PR, that would be my vote.
There was a problem hiding this comment.
Using an enum allows me to use the API without making the choice.
I'm not sure how valid this is. The choice can make a fairly big difference on how you write your algorithm, so you will generally need to make the choice and be aware of how it impacts your code, right out.
(You will also have to specify one of the two enum values either way)
There was a problem hiding this comment.
I'm not sure how valid this is.
It's completely valid when I'm exploring the API to learn how it works.
(You will also have to specify one of the two enum values either way)
Only in the 3 enums design. As currently proposed I wouldn't need to specify Bit/UnitMask when calling CompareMask.
There was a problem hiding this comment.
As currently proposed I wouldn't need to specify Bit/UnitMask when calling CompareMask.
I'm not convinced that's a good thing 😄
There was a problem hiding this comment.
That is, these APIs and instructions are already fairly hard to understand. Making parts of it more obvious, even if it results in us exposing double the methods might be a good thing (It looks like we have ~50 methods exposed right now).
|
At this point, I would propose changing this PR so that it doesn't change the enum names or usage (i.e. only do changes 1 and 3. |
Ok, let me separate SSE4.2 changes into a new PR. |
| /// PEXTRB reg/m8, xmm, imm8 | ||
| /// </summary> | ||
| public static sbyte Extract(Vector128<sbyte> value, byte index) => Extract(value, index); | ||
| public static int Extract(Vector128<sbyte> value, byte index) => Extract(value, index); |
There was a problem hiding this comment.
Also fixes SSE2/SSE4.1/AVX Extract return type to reflect the underlying instruction behavior https://github.com/dotnet/coreclr/issues/17957
tannergooding
left a comment
There was a problem hiding this comment.
LGTM. Couple minor questions.
e7e8f0c to
0157aab
Compare
| Store(buffer, value); | ||
| return buffer[index]; | ||
| } | ||
| return Unsafe.Add<byte>(ref Unsafe.As<Vector256<byte>, byte>(ref value), index & 0x1F); |
There was a problem hiding this comment.
Removed sbyte and short overloads of SSE2/SSE4.1/AVX Extract and simplify Avx.Extract non-const fallback as @jkotas's suggestion.
I would prepare the CoreFX counterpart if this PR looks good to you guys.
|
@eerhardt Can we merge this PR? |
|
Oops - I thought this was already merged. Merging now. |
* Improve Intel hardware intrinsic APIs * Simplify Avx.Extract non-const fallback Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
* Improve Intel hardware intrinsic APIs * Simplify Avx.Extract non-const fallback Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
* Improve Intel hardware intrinsic APIs * Simplify Avx.Extract non-const fallback Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
* Improve Intel hardware intrinsic APIs * Simplify Avx.Extract non-const fallback Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
This will disable the test from being run with smart in our windows arm testing. This corresponds to the tests being deleted in dotnet#17637.
This will disable the test from being run with smart in our windows arm testing. This corresponds to the tests being deleted in #17637.
* Improve Intel hardware intrinsic APIs * Simplify Avx.Extract non-const fallback Commit migrated from dotnet/coreclr@ea58e86
) This will disable the test from being run with smart in our windows arm testing. This corresponds to the tests being deleted in dotnet/coreclr#17637. Commit migrated from dotnet/coreclr@b55d2b8

This PR
encodes the result flags in the names of certain SSE4.2 string processing intrinsic. That provides more stable runtime behaviors and simplifies JIT implementation, discussed in https://github.com/dotnet/coreclr/issues/16270InsertAPI onfloatbase type, closes https://github.com/dotnet/coreclr/issues/18143.Extractreturn type to reflect the underlying instruction behavior, closes https://github.com/dotnet/coreclr/issues/17957@CarolEidt @tannergooding @eerhardt