Skip to content

[RyuJit/x64] Generates extra movsxd when using a byte value as an address offset #8465

@saucecontrol

Description

@saucecontrol

In image processing, it's common to map byte values (8-bit-per-channel pixels) through a lookup table.

RyuJit x64 apparently uses a 32-bit movzx to read byte values, then widens using movsxd before using them as address offsets.

Here's a quick sample program demonstrating the issue:

using System;
using System.Runtime.CompilerServices;

class Program
{
	[MethodImpl(MethodImplOptions.NoInlining)]
	unsafe static void LUTMap(byte* src, float* dest, float* lut, int cb)
	{
		byte* end = src + cb - 4;
		while(src <= end)
		{
			dest[0] = lut[src[0]];
			dest[1] = lut[src[1]];
			dest[2] = lut[src[2]];
			dest[3] = lut[src[3]];

			src  += 4;
			dest += 4;
		}
	}

	unsafe static void Main(string[] args)
	{
		var bytes = new byte[1024 * 1024];
		var floats = new float[bytes.Length];
		var lut = new float[256];


		new Random(42).NextBytes(bytes);
		for (int i = 0; i < lut.Length; i++)
			lut[i] = (float)Math.Pow(i, 1 / 2.4);

		fixed (byte* pbytes = &bytes[0])
		fixed (float* pfloats = &floats[0], plut = &lut[0])
		{
			LUTMap(pbytes, pfloats, plut, bytes.Length);
		}
	}
}

RyuJit x64 generates the following for the LUTMap method

; Assembly listing for method Program:LUTMap(long,long,long,int)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] ( 11, 32   )    long  ->  rcx
;  V01 arg1         [V01,T01] (  8, 26   )    long  ->  rdx
;  V02 arg2         [V02,T02] (  6, 18   )    long  ->   r8
;  V03 arg3         [V03,T04] (  3,  3   )     int  ->   r9
;  V04 loc0         [V04,T03] (  3,  6   )    long  ->  rax
;# V05 OutArgs      [V05    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0

G_M15454_IG01:
       C5F877               vzeroupper

G_M15454_IG02:
       4963C1               movsxd   rax, r9d
       488D4401FC           lea      rax, [rcx+rax-4]
       483BC8               cmp      rcx, rax
       7760                 ja       SHORT G_M15454_IG04

G_M15454_IG03:
       440FB609             movzx    r9, byte  ptr [rcx]
       4D63C9               movsxd   r9, r9d
       C4817A100488         vmovss   xmm0, dword ptr [r8+4*r9]
       C4E17A1102           vmovss   dword ptr [rdx], xmm0
       440FB64901           movzx    r9, byte  ptr [rcx+1]
       4D63C9               movsxd   r9, r9d
       C4817A100488         vmovss   xmm0, dword ptr [r8+4*r9]
       C4E17A114204         vmovss   dword ptr [rdx+4], xmm0
       440FB64902           movzx    r9, byte  ptr [rcx+2]
       4D63C9               movsxd   r9, r9d
       C4817A100488         vmovss   xmm0, dword ptr [r8+4*r9]
       C4E17A114208         vmovss   dword ptr [rdx+8], xmm0
       440FB64903           movzx    r9, byte  ptr [rcx+3]
       4D63C9               movsxd   r9, r9d
       C4817A100488         vmovss   xmm0, dword ptr [r8+4*r9]
       C4E17A11420C         vmovss   dword ptr [rdx+12], xmm0
       4883C104             add      rcx, 4
       4883C210             add      rdx, 16
       483BC8               cmp      rcx, rax
       76A5                 jbe      SHORT G_M15454_IG03

G_M15454_IG04:
       C3                   ret

; Total bytes of code 108, prolog size 3 for method Program:LUTMap(long,long,long,int)
; ============================================================

Can it be modified to use the 64-bit movzx in cases like this?

category:cq
theme:basic-cq
skill-level:intermediate
cost:medium
impact:small

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIenhancementProduct code improvement that does NOT require public API changes/additionsoptimizationtenet-performancePerformance related issue

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions