-
Notifications
You must be signed in to change notification settings - Fork 5.3k
[Draft] Accelerate Half with FP16 ISA
#122649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@tannergooding @jakobbotsch please take a look when you get a chance. |
|
|
||
| #if defined(TARGET_AMD64) && !defined(UNIX_AMD64_ABI) | ||
| assert(!varTypeIsStruct(treeNode)); | ||
| assert(!varTypeIsStruct(treeNode) || treeNode->TypeGet() == TYP_HALF); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assert(!varTypeIsStruct(treeNode) || treeNode->TypeGet() == TYP_HALF); | |
| assert(!varTypeIsStruct(treeNode) || treeNode->TypeIs(TYP_HALF)); |
How do SIMD types avoid hitting this assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe its due to on windows the SIMD type would be passed/returned via a buffer, which we have avoided doing with the half type and should have been transformed prior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my understanding as well.
Notably that is "incorrect" as SIMD types are supposed to be returned in register on Windows and this is a known inconsistency we want to fix long term (#9578)
3b8abaa to
f633726
Compare
|
@dotnet/intel @tannergooding may I get some high level feedback on the structure of the PR? |
src/coreclr/jit/compiler.cpp
Outdated
| if (!compOpportunisticallyDependsOn(InstructionSet_AVX10v1)) | ||
| { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this last, not first, otherwise code gets tagged as benefiting from using AVX10v1 unnecessarily
| // kmov instructions reach this path with EA_8BYTE size, even on x86 | ||
| || IsKMOVInstruction(ins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for removing this part of the assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think that was an error, will fix.
src/coreclr/jit/emitxarch.cpp
Outdated
|
|
||
| case INS_vmovsh: | ||
| { | ||
| hasSideEffect = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this have a side effect of clearing the upper-bits?
That is, it always does DEST[MAXVL:128] := 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, I will change.
src/coreclr/jit/emitxarch.cpp
Outdated
|
|
||
| #if defined(TARGET_AMD64) | ||
| case INS_movsxd: | ||
| case INS_vmovsh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't TARGET_AMD64 exclusive as vmovsh is listed with V/V for support, so is valid for both 64 and 32-bit mode.
| if (IsXMMReg(reg)) | ||
| { | ||
| return emitXMMregName(reg); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be TARGET_AMD64 exclusive either.
src/coreclr/jit/emitxarch.cpp
Outdated
| else if (code & 0xFF000000) | ||
| { | ||
| if (size == EA_2BYTE) | ||
| if (size == EA_2BYTE && (ins != INS_vmovsh && ins != INS_vaddsh)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use && !IsSimdInstruction(ins)?
src/coreclr/jit/emitxarch.cpp
Outdated
| case INS_movapd: | ||
| case INS_movupd: | ||
| // todo-xarch-half: come back to fix | ||
| case INS_vmovsh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be grouped with vmovss and vmovsd? While we may not have exact numbers, I'd expect it to have identical perf/latency to those rather than the more general movaps and friends.
src/coreclr/jit/emitxarch.cpp
Outdated
| float insLatency = insLatencyInfos[ins]; | ||
|
|
||
| // todo-xarch-half: hacking an exit on the unhandled ins to make prototyping easier | ||
| if (ins == INS_vcvtss2sh || ins == INS_vcvtsh2ss || ins == INS_vaddsh || ins == INS_vsubsh || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to put most of these with the v*ss and v*sd equivalents prior to mergine this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and for the above, I will get the proper numbers before putting the PR in non-draft.
src/coreclr/jit/gentree.cpp
Outdated
| // todo-half: this is only to create zero constant half nodes for use in instrincis, anything | ||
| // else will not work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this comment.
Presumably we just need a FloatingPointUtils::convertDoubleToHalf(...) method which returns a float16_t type (these were added in C++23, which is newer than our baseline, so we'd just typedef uint16_t float16_t; for the time being).
We then vecCon->gtSimdVal.f16[i] = cnsVal
src/coreclr/jit/gentree.h
Outdated
| { | ||
| if (arg->IsCnsFltOrDbl()) | ||
| { | ||
| simdVal.f16[argIdx] = static_cast<uint16_t>(arg->AsDblCon()->DconValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks incorrect as it does a double->uint16_t cast, when we rather need double->float16_t
| } | ||
| } | ||
| else if (node->TypeIs(TYP_VOID)) | ||
| else if (node->TypeIs(TYP_VOID) || node->TypeIs(TYP_INT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it was also a bug, I have removed.
| if (sizeBytes < getMinVectorByteLength()) | ||
| { | ||
| *pSimdBaseJitType = simdBaseType; | ||
| // The struct itself is accelerated, in this case, it is `Half`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an assert(sizeBytes == 2) in case we add other sizes in the future?
| break; | ||
| } | ||
|
|
||
| case NI_System_Half_op_Increment: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these, like Increment/Decrement, could be merged as well using lookupHalfIntrinsic
| if (srcSize == 2) | ||
| return INS_vmovsh; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General convention is to have braces, particularly if it is part of an if/else chain:
| if (srcSize == 2) | |
| return INS_vmovsh; | |
| if (srcSize == 2) | |
| { | |
| return INS_vmovsh; | |
| } |
src/coreclr/jit/lower.cpp
Outdated
| // if (node->TypeGet() == TYP_HALF) | ||
| //{ | ||
| // return false; | ||
| // } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dead code?
src/coreclr/jit/lsrabuild.cpp
Outdated
| case TYP_HALF: | ||
| #ifdef TARGET_X86 | ||
| useCandidates = RBM_FLOATRET; | ||
| #else | ||
| useCandidates = RBM_FLOATRET.GetFloatRegSet(); | ||
| #endif | ||
| break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to be identical to the TYP_FLOAT path and can be collapsed to share it:
| case TYP_HALF: | |
| #ifdef TARGET_X86 | |
| useCandidates = RBM_FLOATRET; | |
| #else | |
| useCandidates = RBM_FLOATRET.GetFloatRegSet(); | |
| #endif | |
| break; | |
| case TYP_HALF: |
| // We ONLY want the valid double register in the RBM_DOUBLERET mask. | ||
| #ifdef TARGET_AMD64 | ||
| useCandidates = (RBM_DOUBLERET & RBM_ALLDOUBLE).GetFloatRegSet(); | ||
| #else | ||
| useCandidates = (RBM_DOUBLERET & RBM_ALLDOUBLE).GetFloatRegSet(); | ||
| #endif // TARGET_AMD64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR, but these two paths are the same
3537f96 to
4235f30
Compare
Draft PR for in-progress work to accelerate
System.Halfwith FP16 ISA.Current work done:
Add a
TYP_HALFto the .NET runtime, which is treated like aTYP_SIMDXX, but with some notable differences. Namely, aTYP_HALFis passed around via the xmm registers, and while it will pass avarTypeIsStructtest, it must be treated as a primitive in other places.Accelerate
System.Halfoperations with theTYP_HALFand some FP16 intrinsics. Not every possible function has been accelerated yet.For discussion:
I have currently worked around some checks to make
TYP_HALFbehave like a struct and a primitive. It's very ad-hoc at the moment.Much of the work to transform the named
System.Halfintrinsics into a sequence of intrinsic nodes is done inimportcall.cppand might want to be moved up into some of thegtNewSimdXXnodes.