Skip to content

Optimize Vector4.Lerp#35525

Merged
tannergooding merged 3 commits intodotnet:masterfrom
EgorBo:vector4-lerp
May 4, 2020
Merged

Optimize Vector4.Lerp#35525
tannergooding merged 3 commits intodotnet:masterfrom
EgorBo:vector4-lerp

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented Apr 27, 2020

Before:

G_M10056_IG01:
       sub      rsp, 56
       vzeroupper 
       vmovaps  qword ptr [rsp+20H], xmm6
       vmovaps  qword ptr [rsp+10H], xmm7
       vmovaps  qword ptr [rsp], xmm8
G_M10056_IG02:
       vmovss   xmm0, dword ptr [rdx]
       vmovss   xmm1, dword ptr [rdx+4]
       vmovss   xmm2, dword ptr [rdx+8]
       vmovss   xmm4, dword ptr [rdx+12]
       vmovss   xmm5, dword ptr [r8]
       vmovss   xmm6, dword ptr [r8+4]
       vmovss   xmm7, dword ptr [r8+8]
       vmovss   xmm8, dword ptr [r8+12]
       vsubss   xmm5, xmm5, xmm0
       vmulss   xmm5, xmm5, xmm3
       vaddss   xmm0, xmm5, xmm0
       vsubss   xmm5, xmm6, xmm1
       vmulss   xmm5, xmm5, xmm3
       vaddss   xmm1, xmm5, xmm1
       vsubss   xmm5, xmm7, xmm2
       vmulss   xmm5, xmm5, xmm3
       vaddss   xmm2, xmm5, xmm2
       vsubss   xmm5, xmm8, xmm4
       vmulss   xmm3, xmm5, xmm3
       vaddss   xmm3, xmm3, xmm4
       vxorps   xmm4, xmm4
       vmovss   xmm4, xmm4, xmm3
       vpslldq  xmm4, 4
       vmovss   xmm4, xmm4, xmm2
       vpslldq  xmm4, 4
       vmovss   xmm4, xmm4, xmm1
       vpslldq  xmm4, 4
       vmovss   xmm4, xmm4, xmm0
       vmovaps  xmm0, xmm4
       vmovupd  xmmword ptr [rcx], xmm0
       mov      rax, rcx
G_M10056_IG03:
       vmovaps  xmm6, qword ptr [rsp+20H]
       vmovaps  xmm7, qword ptr [rsp+10H]
       vmovaps  xmm8, qword ptr [rsp]
       add      rsp, 56
       ret      
; Total bytes of code: 182

After:

       vzeroupper 
G_M18874_IG02:
       vmovupd  xmm0, xmmword ptr [r8]
       vmovupd  xmm1, xmmword ptr [rdx]
       vsubps   xmm0, xmm1
       vbroadcastss xmm3, xmm3
       vmulps   xmm0, xmm3
       vaddps   xmm0, xmm1, xmm0
       vmovupd  xmmword ptr [rcx], xmm0
       mov      rax, rcx
G_M18874_IG03:
       ret

@ghost
Copy link

ghost commented Apr 27, 2020

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

@EgorBo
Copy link
Member Author

EgorBo commented Apr 27, 2020

In theory, the following implementation should be faster

private static Vector4 Lerp(Vector4 value1, Vector4 value2, float amount)
{
    // x86 with FMA
    Vector128<float> amountVec = Vector128.Create(amount);
    return Fma.MultiplyAdd(amountVec, value2.AsVector128(), 
        Fma.MultiplyAddNegated(amountVec, value1.AsVector128(), value1.AsVector128())).AsVector4();
}

but only in some sort of fast-math mode

@EgorBo
Copy link
Member Author

EgorBo commented Apr 27, 2020

Vector2:

Before

       vzeroupper 
       mov      qword ptr [rsp+08H], rcx
       mov      qword ptr [rsp+10H], rdx
G_M38716_IG02:
       vmovss   xmm0, dword ptr [rsp+10H]
       vmovss   xmm1, dword ptr [rsp+08H]
       vsubss   xmm0, xmm0, xmm1
       vmulss   xmm0, xmm0, xmm2
       vaddss   xmm0, xmm0, xmm1
       vmovss   xmm1, dword ptr [rsp+14H]
       vmovss   xmm3, dword ptr [rsp+0CH]
       vsubss   xmm1, xmm1, xmm3
       vmulss   xmm1, xmm1, xmm2
       vaddss   xmm1, xmm1, xmm3
       vxorps   xmm2, xmm2
       vmovss   xmm2, xmm2, xmm1
       vpslldq  xmm2, 4
       vmovss   xmm2, xmm2, xmm0
       vmovaps  xmm0, xmm2
       vmovd    rax, xmm0
G_M38716_IG03:
       ret      
; Total bytes of code: 88

After:

       push     rax
       vzeroupper 
       vmovd    xmm0, rcx
       vmovd    xmm1, rdx
G_M37838_IG02:
       vsubps   xmm1, xmm0
       vxorps   xmm3, xmm3
       vmovss   xmm3, xmm3, xmm2
       vpslldq  xmm3, 4
       vmovss   xmm3, xmm3, xmm2
       vmovaps  xmm2, xmm3
       vmulps   xmm1, xmm2
       vmovsd   qword ptr [rsp], xmm1
       vmovsd   xmm1, qword ptr [rsp]
       vaddps   xmm0, xmm1
       vmovd    rax, xmm0
G_M37838_IG03:
       add      rsp, 8
       ret      
; Total bytes of code: 67

@tannergooding
Copy link
Member

Closing and reopening to retrigger the run against current master. It should be good to merge once tests pass.

@ghost
Copy link

ghost commented May 4, 2020

Hello @tannergooding!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@EgorBo
Copy link
Member Author

EgorBo commented May 4, 2020

@tannergooding the failing job is a known issue: #35812

@tannergooding tannergooding merged commit 2848dbf into dotnet:master May 4, 2020
@tannergooding
Copy link
Member

Thanks! Merged.

Updating Vector2/3/4 to be consistent is just pending final approval here at which point we can fix them up: #35529
That will also open things up to use System.Runtime.Intrinsics.Fma when available.

@EgorBo EgorBo deleted the vector4-lerp branch May 25, 2020 11:54
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants