Skip to content

VCRuntime: memmove() is surprisingly slow for more than 8 KB on certain CPUs #5506

@StephanTLavavej

Description

@StephanTLavavej

Extracted from #5502 by @AlexGuteniev:

🐛 memmove performance bug

After the initial implementation I observed that some of the benchmark exhibited unexpected slowdown. The issue was in surprisingly slow memmove. I've created benchmark repro of this problem.

Benchmark

#include <benchmark/benchmark.h>
#include <cstring>

using namespace std;

alignas(4096) unsigned char v[1024 * 1024];

void bm_memmove(benchmark::State& state) {
    const auto size = static_cast<size_t>(state.range(0));
    const auto n    = static_cast<ptrdiff_t>(state.range(1));

    const size_t n1 = n < 0 ? 0 : n;
    const size_t n0 = n < 0 ? -n : 0;

    benchmark::DoNotOptimize(v);

    for (auto _ : state) {
        memmove(v + n0, v + n1, size);
        benchmark::DoNotOptimize(v);
    }
}

BENCHMARK(bm_memmove)->ArgsProduct({{8191, 8193}, {-5, +5}});

BENCHMARK_MAIN();

Results on i5-1235U (Alder Lake)

-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
bm_memmove/8191/-5       71.4 ns         71.5 ns      8960000
bm_memmove/8193/-5       71.1 ns         71.5 ns      8960000
bm_memmove/8191/5        62.6 ns         61.0 ns      8960000
bm_memmove/8193/5        1903 ns         1925 ns       373333

Results on i7-8750H (Coffee Lake)

-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
bm_memmove/8191/-5        143 ns          141 ns      4977778
bm_memmove/8193/-5        145 ns          146 ns      4480000
bm_memmove/8191/5        77.2 ns         76.7 ns      8960000
bm_memmove/8193/5        80.9 ns         80.2 ns      8960000

Analysis

All I know or suspect so far:

  • The problem exists on Alder Lake (Intel Core 12th gen) but does not exist on Coffee Lake (Intel Core 8th gen) or Skylake (Intel Core 6th gen)
  • The problematic instruction is rep movsb, which is used in memmove
  • The problematic behavior is recreated for me when the size is greater than 8192 and the pointer difference is smaller than 64
  • Clang on Linux is also affected, proved by recreating the issue here: https://quick-bench.com/q/HgY3kPAaUIqkfmzwz_NFeoTcj3U

I appreciate any help in investigating the issue further.

Ideally we'd need to report this issue to VCRuntime maintainers.
But I feel like we need to try to gather more information to report it better.

Metadata

Metadata

Assignees

No one assigned

    Labels

    externalThis issue is unrelated to the STLresolvedSuccessfully resolved without a commit

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions