-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Closed
Closed
Copy link
Labels
area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMICLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIenhancementProduct code improvement that does NOT require public API changes/additionsProduct code improvement that does NOT require public API changes/additionsoptimizationtenet-performancePerformance related issuePerformance related issue
Milestone
Description
In image processing, it's common to map byte values (8-bit-per-channel pixels) through a lookup table.
RyuJit x64 apparently uses a 32-bit movzx to read byte values, then widens using movsxd before using them as address offsets.
Here's a quick sample program demonstrating the issue:
using System;
using System.Runtime.CompilerServices;
class Program
{
[MethodImpl(MethodImplOptions.NoInlining)]
unsafe static void LUTMap(byte* src, float* dest, float* lut, int cb)
{
byte* end = src + cb - 4;
while(src <= end)
{
dest[0] = lut[src[0]];
dest[1] = lut[src[1]];
dest[2] = lut[src[2]];
dest[3] = lut[src[3]];
src += 4;
dest += 4;
}
}
unsafe static void Main(string[] args)
{
var bytes = new byte[1024 * 1024];
var floats = new float[bytes.Length];
var lut = new float[256];
new Random(42).NextBytes(bytes);
for (int i = 0; i < lut.Length; i++)
lut[i] = (float)Math.Pow(i, 1 / 2.4);
fixed (byte* pbytes = &bytes[0])
fixed (float* pfloats = &floats[0], plut = &lut[0])
{
LUTMap(pbytes, pfloats, plut, bytes.Length);
}
}
}RyuJit x64 generates the following for the LUTMap method
; Assembly listing for method Program:LUTMap(long,long,long,int)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 11, 32 ) long -> rcx
; V01 arg1 [V01,T01] ( 8, 26 ) long -> rdx
; V02 arg2 [V02,T02] ( 6, 18 ) long -> r8
; V03 arg3 [V03,T04] ( 3, 3 ) int -> r9
; V04 loc0 [V04,T03] ( 3, 6 ) long -> rax
;# V05 OutArgs [V05 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0
G_M15454_IG01:
C5F877 vzeroupper
G_M15454_IG02:
4963C1 movsxd rax, r9d
488D4401FC lea rax, [rcx+rax-4]
483BC8 cmp rcx, rax
7760 ja SHORT G_M15454_IG04
G_M15454_IG03:
440FB609 movzx r9, byte ptr [rcx]
4D63C9 movsxd r9, r9d
C4817A100488 vmovss xmm0, dword ptr [r8+4*r9]
C4E17A1102 vmovss dword ptr [rdx], xmm0
440FB64901 movzx r9, byte ptr [rcx+1]
4D63C9 movsxd r9, r9d
C4817A100488 vmovss xmm0, dword ptr [r8+4*r9]
C4E17A114204 vmovss dword ptr [rdx+4], xmm0
440FB64902 movzx r9, byte ptr [rcx+2]
4D63C9 movsxd r9, r9d
C4817A100488 vmovss xmm0, dword ptr [r8+4*r9]
C4E17A114208 vmovss dword ptr [rdx+8], xmm0
440FB64903 movzx r9, byte ptr [rcx+3]
4D63C9 movsxd r9, r9d
C4817A100488 vmovss xmm0, dword ptr [r8+4*r9]
C4E17A11420C vmovss dword ptr [rdx+12], xmm0
4883C104 add rcx, 4
4883C210 add rdx, 16
483BC8 cmp rcx, rax
76A5 jbe SHORT G_M15454_IG03
G_M15454_IG04:
C3 ret
; Total bytes of code 108, prolog size 3 for method Program:LUTMap(long,long,long,int)
; ============================================================Can it be modified to use the 64-bit movzx in cases like this?
category:cq
theme:basic-cq
skill-level:intermediate
cost:medium
impact:small
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMICLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIenhancementProduct code improvement that does NOT require public API changes/additionsProduct code improvement that does NOT require public API changes/additionsoptimizationtenet-performancePerformance related issuePerformance related issue