-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Majority of hardware intrinsics arithmetic operations support using memory address as one of it's operands. It allows to write more efficient code which would bypass memory bottlenecks. Unfortunately jit does not fold memory loads into one of arithmetic operation operands and generates code for separate loads or stores.
The following example illustrates the problem (expression was specifically written to hint jit that second subtraction operand should not be loaded but folded into memory operand):
StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2))); ;StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2)));
00007ffd`19ab4180 4983c310 add r11, 10h
00007ffd`19ab4184 c4c1792813 vmovapd xmm2, xmmword ptr [r11]
00007ffd`19ab4189 c4e1715cd2 vsubpd xmm2, xmm1, xmm2
00007ffd`19ab418e 4983c210 add r10,10h
00007ffd`19ab4192 c4c17b1112 vmovsd qword ptr [r10], xmm2This code has two problems: (i) inefficient memory address calculation, (ii) memory operands not folded into one of vsubpd operands. There are some possible optimizations.
- Fold last
vsubpdoperand into memory address.
;StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2)));
00007ffd`19ab4180 4983c310 add r11, 10h
00007ffd`19ab4189 c4e1715cd2 vsubpd xmm2, xmm1, xmmword ptr [r11]
00007ffd`19ab418e 4983c210 add r10,10h
00007ffd`19ab4192 c4c17b1112 vmovsd qword ptr [r10], xmm2- Improve addressing of operands - this particular problem is tracked by Codegen for
LoadVector128for a field of a struct is "poor" #10915
;StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2)));
00007ffd`19ab4189 c4e1715cd2 vsubpd xmm2, xmm1, xmmword ptr [r11 + 10h]
00007ffd`19ab4192 c4c17b1112 vmovsd qword ptr [r10 + 10h], xmm2By applying these optimizations the above code should be roughly 2.5 x faster.
There are several solutions to the memory operand handling.
The simplest one is to give control to developers and provide overloads which would allow to pass memory pointers besides Vector128<T> or Vector256<T>.
Vector128<double> Sse2.Add(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Add(Vector128<double> left, double* right);
Vector128<double> Sse2.Divide(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Divide(Vector128<double> left, double* right);
Vector128<double> Sse2.Multiply(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Multiply(Vector128<double> left, double* right);
Vector128<double> Sse2.Subtract(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Subtract(Vector128<double> left, double* right);For Vector128<T> marked as blittable type it should be possible to have even better self documenting overloads (providing C# would support pointers to generic blittable types).
Vector128<double> Sse2.Add(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Add(Vector128<double> left, Vector128<double>* right);More complex and very inefficient form developer perspective is to provide jit support for folding loads and Unsafe.Read<T> reads into memory operands. Unfortunately the burden to write more code would make use of intrinsics even more harder and some developers would not even know how to use that support without digging into docs.
IMHO the best solution would be to expand API surface as this would be self documenting enhancement. Furthermore, from my experience managing data flow through memory avoiding memory wall while using HW intrinsics is one of the most difficult parts of the coding with them.