stackalloc long[3] is slower than [0,0,0]

Noticed while working on https://github.com/dotnet/runtime/pull/121225

Here is the minimal repro:
```cs
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Engines;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);

public class Benchmarks
{
    [Benchmark]
    public long Bench_stackalloc() => ParseNonCanonical_stackalloc("11");

    [Benchmark]
    public long Bench_InlineArray() => ParseNonCanonical_InlineArray("11");


    [MethodImpl(MethodImplOptions.NoInlining)]
    int ParseNonCanonical_stackalloc(ReadOnlySpan<char> name)
    {
        Span<long> parts = stackalloc long[3];
        Consume(parts);
        return name[1];
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    int ParseNonCanonical_InlineArray(ReadOnlySpan<char> name)
    {
        Span<long> parts = [0, 0, 0];
        Consume(parts);
        return name[1];
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Consume(Span<long> parts) { }
}
```
Benchmarks results on Linux-x64:
```
| Method            | Mean     | Error     | StdDev    |
|------------------ |---------:|----------:|----------:|
| Bench_stackalloc  | 6.967 ns | 0.1560 ns | 0.2135 ns |
| Bench_InlineArray | 1.608 ns | 0.0043 ns | 0.0034 ns |
```

Presumably, the perf penalty comes from Store Forwarding:
```asm
       vmovdqu  xmm0, xmmword ptr [rsp+0x30]
       vmovdqu  xmmword ptr [rsp+0x20], xmm0
```
I haven't looked into JitDump yet to tell why.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stackalloc long[3] is slower than [0,0,0] #121248

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

stackalloc long[3] is slower than [0,0,0] #121248

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions