-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
Description
Describe the bug
Some STL algorithms will automatically use the AVX version if supported, without calling _mm256_zeroupper() at the end of the function, which will slow down subsequent SSE code on some CPUs (e.g. Skylake).
Why is this SSE code 6 times slower without VZEROUPPER on Skylake?
In following code, the second calculate() is much slower than the first calculate().
Command-line test case
C:\Temp>type repro.cpp
#include <Windows.h>
#include <vector>
#include <algorithm>
void calculate() {
int t = ::GetTickCount();
float a = 1.0f, b = 1.0f, m = 0.0f;
for (int i = 0; i < 30000000; i++) {
m = a * b;
m -= a * 0.1f;
m += b * 0.2f;
}
printf("time cost %d\n", ::GetTickCount() - t);
}
int main() {
std::vector<float> v1;
v1.resize(10);
for (auto it = v1.begin(); it != v1.end(); it++)
*it = (float)rand();
std::vector<float> tmp(10);
std::reverse_copy(v1.end() - 7, v1.end(), tmp.begin());
calculate();
std::reverse_copy(v1.end() - 8, v1.end(), tmp.begin()); // AVX code path is selected.
calculate();
}
C:\Temp>cl /EHsc /W4 /WX .\repro.cpp
Microsoft (R) C/C++ Optimizing Compiler Version 19.34.31937 for x86
Copyright (C) Microsoft Corporation. All rights reserved.
repro.cpp
Microsoft (R) Incremental Linker Version 14.34.31937.0
Copyright (C) Microsoft Corporation. All rights reserved.
/out:repro.exe
repro.obj
C:\Temp>.\repro.exe
time cost 172
time cost 296
Reactions are currently unavailable