Skip to content

JIT: inefficient codegen for calls returning 16-byte structs on Linux x64 / arm64 #8571

@AndyAyersMS

Description

@AndyAyersMS

From the binarytrees performance benchmark, initial call to bottomUpTree from Bench (other calls to this method have similar issues)

;;; Windows (return via hidden byref)
       488D4C2428           lea      rcx, bword ptr [rsp+28H]    ;; address of return byref
       448BC3               mov      r8d, ebx
       33D2                 xor      edx, edx
       E863FBFFFF           call     TreeNode:bottomUpTree(int,int):struct
       488D4C2428           lea      rcx, bword ptr [rsp+28H]
       E869FBFFFF           call     TreeNode:itemCheck():int:this

;;; Linux (return in register pair)
       418BF7               mov      esi, r15d
       33FF                 xor      edi, edi
       E85BFBFFFF           call     TreeNode:bottomUpTree(int,int):struct
       48894598             mov      gword ptr [rbp-68H], rax       ;; spill return to temp
       488955A0             mov      qword ptr [rbp-60H], rdx
       488D7D98             lea      rdi, bword ptr [rbp-68H]       ;; copy temp to another temp
       488B07               mov      rax, gword ptr [rdi]
       488945B0             mov      gword ptr [rbp-50H], rax
       8B7F08               mov      edi, dword ptr [rdi+8]
       897DB8               mov      dword ptr [rbp-48H], edi
       488D7DB0             lea      rdi, bword ptr [rbp-50H]       ;; pass 2nd temp to itemCheck
       E849FBFFFF           call     TreeNode:itemCheck():int:this

bottomUpTree has similar issues at its recursive call sites, and also does some redundant zeroing of temp structs that were zeroed in the prolog:

;; prolog: zero from rbp-28H to rbp-88H
       488DBD78FFFFFF       lea      rdi, [rbp-88H]
       B918000000           mov      ecx, 24
       33C0                 xor      rax, rax
       F3AB                 rep stosd 

;; later: re-zero part of the range
       488D7DB8             lea      rdi, bword ptr [rbp-48H]

G_M53682_IG03:
       660F57C0             xorpd    xmm0, xmm0
       F30F7F07             movdqu   qword ptr [rdi], xmm0

;; later: re-zero another part, overwrite it (partially with a zero),
;; then immediately read & return the values just written as a pair
       488D45C8             lea      rax, bword ptr [rbp-38H]

G_M53682_IG09:
       660F57C0             xorpd    xmm0, xmm0
       F30F7F00             movdqu   qword ptr [rax], xmm0

G_M53682_IG10:
       895DD0               mov      dword ptr [rbp-30H], ebx
       33C0                 xor      rax, rax
       488945C8             mov      gword ptr [rbp-38H], rax
       488B45C8             mov      rax, gword ptr [rbp-38H]
       488B55D0             mov      rdx, qword ptr [rbp-30H]

G_M53682_IG11:
       488D65D8             lea      rsp, [rbp-28H]
       5B                   pop      rbx
       415C                 pop      r12
       415D                 pop      r13
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret      

Note this latter bit of code could simply be something like

   movsx   rdx, ebx
   lea     rsp, ...
   ...
   ret

category:cq
theme:structs
skill-level:expert
cost:large

Metadata

Metadata

Assignees

No one assigned

    Labels

    arch-x64area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIenhancementProduct code improvement that does NOT require public API changes/additionsoptimizationtenet-performancePerformance related issue

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions