Skip to content

Codegen for Hardware Intrinsics arithmetic operations memory operands is poor #10923

@4creators

Description

@4creators

Majority of hardware intrinsics arithmetic operations support using memory address as one of it's operands. It allows to write more efficient code which would bypass memory bottlenecks. Unfortunately jit does not fold memory loads into one of arithmetic operation operands and generates code for separate loads or stores.

The following example illustrates the problem (expression was specifically written to hint jit that second subtraction operand should not be loaded but folded into memory operand):

StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2))); 
;StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2)));                  
00007ffd`19ab4180 4983c310        add     r11, 10h  
00007ffd`19ab4184 c4c1792813      vmovapd xmm2, xmmword ptr [r11]   
00007ffd`19ab4189 c4e1715cd2      vsubpd  xmm2, xmm1, xmm2  
00007ffd`19ab418e 4983c210        add     r10,10h   
00007ffd`19ab4192 c4c17b1112      vmovsd  qword ptr [r10], xmm2

This code has two problems: (i) inefficient memory address calculation, (ii) memory operands not folded into one of vsubpd operands. There are some possible optimizations.

  1. Fold last vsubpd operand into memory address.
;StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2)));
00007ffd`19ab4180 4983c310        add     r11, 10h    
00007ffd`19ab4189 c4e1715cd2      vsubpd  xmm2, xmm1, xmmword ptr [r11]  
00007ffd`19ab418e 4983c210        add     r10,10h   
00007ffd`19ab4192 c4c17b1112      vmovsd  qword ptr [r10], xmm2
  1. Improve addressing of operands - this particular problem is tracked by Codegen for LoadVector128 for a field of a struct is "poor" #10915
;StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2)));                    
00007ffd`19ab4189 c4e1715cd2      vsubpd  xmm2, xmm1, xmmword ptr [r11 + 10h]  
00007ffd`19ab4192 c4c17b1112      vmovsd  qword ptr [r10 + 10h], xmm2

By applying these optimizations the above code should be roughly 2.5 x faster.

There are several solutions to the memory operand handling.

The simplest one is to give control to developers and provide overloads which would allow to pass memory pointers besides Vector128<T> or Vector256<T>.

Vector128<double> Sse2.Add(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Add(Vector128<double> left, double* right);

Vector128<double> Sse2.Divide(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Divide(Vector128<double> left, double* right);

Vector128<double> Sse2.Multiply(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Multiply(Vector128<double> left, double* right);

Vector128<double> Sse2.Subtract(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Subtract(Vector128<double> left, double* right);

For Vector128<T> marked as blittable type it should be possible to have even better self documenting overloads (providing C# would support pointers to generic blittable types).

Vector128<double> Sse2.Add(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Add(Vector128<double> left, Vector128<double>* right);

More complex and very inefficient form developer perspective is to provide jit support for folding loads and Unsafe.Read<T> reads into memory operands. Unfortunately the burden to write more code would make use of intrinsics even more harder and some developers would not even know how to use that support without digging into docs.

IMHO the best solution would be to expand API surface as this would be self documenting enhancement. Furthermore, from my experience managing data flow through memory avoiding memory wall while using HW intrinsics is one of the most difficult parts of the coding with them.

cc @AndyAyersMS @CarolEidt @eerhardt @fiigii @tannergooding

Metadata

Metadata

Assignees

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIenhancementProduct code improvement that does NOT require public API changes/additionsoptimization

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions