Skip to content

<algorithm>: vectorized algorithms hurt performance. #3601

@atyuwen

Description

@atyuwen

Describe the bug

Some STL algorithms will automatically use the AVX version if supported, without calling _mm256_zeroupper() at the end of the function, which will slow down subsequent SSE code on some CPUs (e.g. Skylake).

Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

In following code, the second calculate() is much slower than the first calculate().

Command-line test case

C:\Temp>type repro.cpp

#include <Windows.h>
#include <vector>
#include <algorithm>

void calculate() {
    int t = ::GetTickCount();
    float a = 1.0f, b = 1.0f, m = 0.0f;
    for (int i = 0; i < 30000000; i++) {
        m = a * b;
        m -= a * 0.1f;
        m += b * 0.2f;
    }
    printf("time cost %d\n", ::GetTickCount() - t);
}

int main() {
    std::vector<float> v1;
    v1.resize(10);
    for (auto it = v1.begin(); it != v1.end(); it++)
        *it = (float)rand();
    std::vector<float> tmp(10);

    std::reverse_copy(v1.end() - 7, v1.end(), tmp.begin());
    calculate();

    std::reverse_copy(v1.end() - 8, v1.end(), tmp.begin());  // AVX code path is selected.
    calculate();
}

C:\Temp>cl /EHsc /W4 /WX .\repro.cpp
Microsoft (R) C/C++ Optimizing Compiler Version 19.34.31937 for x86
Copyright (C) Microsoft Corporation.  All rights reserved.

repro.cpp
Microsoft (R) Incremental Linker Version 14.34.31937.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:repro.exe
repro.obj

C:\Temp>.\repro.exe
time cost 172
time cost 296

Metadata

Metadata

Assignees

No one assigned

    Labels

    fixedSomething works now, yay!performanceMust go faster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions