Skip to content

stackalloc long[3] is slower than [0,0,0] #121248

@EgorBo

Description

@EgorBo

Noticed while working on #121225

Here is the minimal repro:

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Engines;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);

public class Benchmarks
{
    [Benchmark]
    public long Bench_stackalloc() => ParseNonCanonical_stackalloc("11");

    [Benchmark]
    public long Bench_InlineArray() => ParseNonCanonical_InlineArray("11");


    [MethodImpl(MethodImplOptions.NoInlining)]
    int ParseNonCanonical_stackalloc(ReadOnlySpan<char> name)
    {
        Span<long> parts = stackalloc long[3];
        Consume(parts);
        return name[1];
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    int ParseNonCanonical_InlineArray(ReadOnlySpan<char> name)
    {
        Span<long> parts = [0, 0, 0];
        Consume(parts);
        return name[1];
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Consume(Span<long> parts) { }
}

Benchmarks results on Linux-x64:

| Method            | Mean     | Error     | StdDev    |
|------------------ |---------:|----------:|----------:|
| Bench_stackalloc  | 6.967 ns | 0.1560 ns | 0.2135 ns |
| Bench_InlineArray | 1.608 ns | 0.0043 ns | 0.0034 ns |

Presumably, the perf penalty comes from Store Forwarding:

       vmovdqu  xmm0, xmmword ptr [rsp+0x30]
       vmovdqu  xmmword ptr [rsp+0x20], xmm0

I haven't looked into JitDump yet to tell why.

Metadata

Metadata

Assignees

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions