Codegen for Hardware Intrinsics arithmetic operations memory operands is poor

Majority of hardware intrinsics arithmetic operations support using memory address as one of it's operands. It allows to write more efficient code which would bypass memory bottlenecks. Unfortunately jit does not fold memory loads into one of arithmetic operation operands and generates code for separate loads or stores.

The following example illustrates the problem (expression was specifically written to hint jit that second subtraction operand should not be loaded but folded into memory operand):

```C#
StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2))); 
```
```asm
;StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2)));                  
00007ffd`19ab4180 4983c310        add     r11, 10h  
00007ffd`19ab4184 c4c1792813      vmovapd xmm2, xmmword ptr [r11]   
00007ffd`19ab4189 c4e1715cd2      vsubpd  xmm2, xmm1, xmm2  
00007ffd`19ab418e 4983c210        add     r10,10h   
00007ffd`19ab4192 c4c17b1112      vmovsd  qword ptr [r10], xmm2
```
This code has two problems: (i) inefficient memory address calculation, (ii) memory operands not folded into one of `vsubpd` operands. There are some possible optimizations.

1. Fold last `vsubpd` operand into memory address.
```asm
;StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2)));
00007ffd`19ab4180 4983c310        add     r11, 10h    
00007ffd`19ab4189 c4e1715cd2      vsubpd  xmm2, xmm1, xmmword ptr [r11]  
00007ffd`19ab418e 4983c210        add     r10,10h   
00007ffd`19ab4192 c4c17b1112      vmovsd  qword ptr [r10], xmm2
```
2. Improve addressing of operands - this particular problem is tracked by dotnet/runtime#10915

```asm
;StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2)));                    
00007ffd`19ab4189 c4e1715cd2      vsubpd  xmm2, xmm1, xmmword ptr [r11 + 10h]  
00007ffd`19ab4192 c4c17b1112      vmovsd  qword ptr [r10 + 10h], xmm2
``` 
By applying these optimizations the above code should be roughly 2.5 x faster.

There are several solutions to the memory operand handling. 

The simplest one is to give control to developers and provide overloads which would allow to pass memory pointers besides `Vector128<T>` or `Vector256<T>`. 

```C#
Vector128<double> Sse2.Add(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Add(Vector128<double> left, double* right);

Vector128<double> Sse2.Divide(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Divide(Vector128<double> left, double* right);

Vector128<double> Sse2.Multiply(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Multiply(Vector128<double> left, double* right);

Vector128<double> Sse2.Subtract(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Subtract(Vector128<double> left, double* right);
```
For `Vector128<T>` marked as blittable type it should be possible to have even better self documenting overloads (providing C# would support pointers to generic blittable types).

```C#
Vector128<double> Sse2.Add(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Add(Vector128<double> left, Vector128<double>* right);
```
More complex and very inefficient form developer perspective is to provide jit support for folding loads and `Unsafe.Read<T>` reads into memory operands. Unfortunately the burden to write more code would make use of intrinsics even more harder and some developers would not even know how to use that support without digging into docs. 

IMHO the best solution would be to expand API surface as this would be self documenting enhancement. Furthermore, from my experience managing data flow through memory avoiding memory wall while using HW intrinsics is one of the most difficult parts of the coding with them.

cc @AndyAyersMS  @CarolEidt @eerhardt @fiigii @tannergooding 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Codegen for Hardware Intrinsics arithmetic operations memory operands is poor #10923

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Codegen for Hardware Intrinsics arithmetic operations memory operands is poor #10923

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions